Zakira.Replay 0.8.0

There is a newer version of this package available.
See the version list below for details.
dotnet tool install --global Zakira.Replay --version 0.8.0
                    
This package contains a .NET tool you can call from the shell/command line.
dotnet new tool-manifest
                    
if you are setting up this repo
dotnet tool install --local Zakira.Replay --version 0.8.0
                    
This package contains a .NET tool you can call from the shell/command line.
#tool dotnet:?package=Zakira.Replay&version=0.8.0
                    
nuke :add-package Zakira.Replay --version 0.8.0
                    

Zakira.Replay

Let LLMs and AI agents "watch" video.

LLMs cannot ingest video natively. Zakira.Replay turns any video source — YouTube URL, conference recording, course lecture, meeting capture, local .mp4 — into the durable, timestamped artifacts an agent can actually reason over. Instead of pretending it watched a 90-minute talk, the agent quotes specific moments with timecodes from artifacts on disk.

The pipeline produces three complementary views of the same video so an agent can pick whichever fits the question:

  • Transcripts — speaker-attributed text from existing captions (via yt-dlp for URLs, sidecar .vtt/.srt for local files) when available, or local Whisper STT (--llm-provider local-whisper) when no captions exist. Silence-aware chunking handles long-form audio without hitting per-request limits. Optional local speaker diarization (--diarize, sherpa-onnx + pyannote + 3D-Speaker) attributes audio when caption tags don't.
  • Vision — representative frames extracted at ffmpeg scene-change boundaries, then routed through OCR (local RapidOCR PP-OCRv5 by default, or LLM-routed) and structured vision analysis (local CLIP / Florence-2 on-device, or GitHub Copilot SDK / OpenAI / Azure OpenAI / Ollama). Perceptual-hash slide grouping runs OCR and vision once per unique on-screen slide and records first/last visible timestamps as facts.
  • Structured evidence — every run lands in runs/<source-slug>-<sha8>/ with manifest.json (pipeline timings + dependency snapshot), evidence.json (timestamped facts), transcript.md, extracted frames, slides, per-speaker registry, optional chapter index, and structured warnings. Schema-versioned and machine-readable so agents can quote with confidence.

One .NET 10 binary ships two surfaces for the same pipeline: a zakira-replay CLI (drive from shell scripts or dnx) and an MCP server (zakira-replay mcp serve) so any MCP-aware agent can analyze video as a first-class tool. Local providers run fully on-device when you want air-gapped operation; cloud LLM providers plug in via Microsoft.Extensions.AI.IChatClient when you want stronger summarization or vision quality.

Part of the Zakira project family.

System binaries (yt-dlp, ffmpeg, ffprobe) and the search-embedding model are opt-in: install them upfront with zakira-replay deps install, or set dependencies.autoDownload=true once and let Zakira.Replay fetch them on first need. Local providers (OCR, Whisper STT, diarization, local vision) auto-download their models on first use by default — opt out per-provider if you want strict offline control. Missing dependencies always fail with a clear, actionable error.

Install

Zakira.Replay ships as a regular .NET global tool on NuGet. Requires the .NET 10 SDK (or runtime).

# Install once, run anywhere (recommended for repeated use):
dotnet tool install -g Zakira.Replay
zakira-replay version

# Or run one-off without installing (.NET 10 SDK only — no install, no cleanup):
dnx Zakira.Replay version

Both invocations expose the same zakira-replay command surface documented below.

Getting Started

Confirm the tool launches and inspect environment readiness:

zakira-replay version
zakira-replay doctor          # one-line per dependency: found / missing / version
zakira-replay info --json     # machine-readable resolved-paths + capability flags

Install the OS-level binaries Zakira.Replay relies on. They are not bundled; deps install fetches portable copies into the configured portable directory (override with dependencies.portableDirectory or ZAKIRA_REPLAY_PORTABLE_DIRECTORY):

# Media essentials (ffmpeg + ffprobe + yt-dlp). Required for almost everything.
zakira-replay deps install media

# Optional local-only providers — install only what you plan to use:
zakira-replay deps install ocr              # RapidOCR PP-OCRv5 latin pack (~30 MB)
zakira-replay deps install whisper-model    # local Whisper ggml model for --llm-provider local-whisper
zakira-replay deps install diarization      # pyannote-segmentation + 3D-Speaker ONNX (for --diarize)
zakira-replay deps install onnx             # all-MiniLM-L6-v2 search embedding model (for sqlite-onnx)

# Or install everything at once:
zakira-replay deps install all

Or: enable on-demand auto-download

If you'd rather have Zakira.Replay fetch system tools the moment they're needed (instead of running deps install upfront), flip one flag:

zakira-replay config set dependencies.autoDownload true

After that, the first invocation that needs yt-dlp auto-fetches it from the official GitHub release (Windows, Linux x64, Linux ARM64, macOS).

Two caveats:

  • ffmpeg / ffprobe portable auto-download is Windows-x64 only. On Linux/macOS install through your package manager (apt install ffmpeg, brew install ffmpeg, etc.).
  • The ONNX search embedding model is separately opt-in: zakira-replay config set search.onnx.autoDownload true to enable for the sqlite-onnx search backend.

Local providers (OCR, Whisper STT, diarization, local vision) already auto-download their models on first use by default — no extra flag required.

Run your first analysis. Output lands under runs/<source-slug>-<sha8>/ (the run-id is deterministic per source so --cache reuse "just works"):

zakira-replay analyze https://www.youtube.com/watch?v=dQw4w9WgXcQ

Inspect what came out:

ls runs/                              # find the generated run folder
cat runs/<run-id>/manifest.json       # pipeline timings, dependency snapshot, artifact index
cat runs/<run-id>/transcript.md       # speaker-attributed transcript
cat runs/<run-id>/evidence.json       # structured timestamped facts for LLM agents

For one-off ad-hoc operations without a full analyze run, see zakira-replay frames, clip, transcribe, and search further down. For agent integration, see MCP Jobs and Agent Skills. For default behaviour, override flags, and per-stage configuration, continue with Commands, Defaults, and Dependency Configuration.

Commands

zakira-replay doctor [--json]
zakira-replay info [--json]
zakira-replay version
zakira-replay analyze <url-or-file> [--vision-instruction <text>] [--ocr-instruction <text>] [--frames <count>] [--frames-per-minute <n>] [--frame-strategy interval|scene|every-frame] [--scene-safety-cap <n>] [--llm-provider github-copilot|openai|azure-openai|ollama|local-whisper] [--ocr-provider copilot|local] [--smart-crop] [--smart-crop-profile auto|teams|zoom|webex|generic|off] [--capture-mode auto|ytdlp|browser] [--auth-profile <name>] [--stt] [--ocr] [--vision] [--diarize] [--num-speakers <n>] [--diarize-threshold <0.0-1.0>] [--caption-languages <list>] [--no-slide-grouping] [--slide-hash-distance <n>] [--run-id <id>] [--cache] [--force]
zakira-replay transcribe <url-or-file> [--stt] [--audio] [--run-id <id>] [--cache] [--force]
zakira-replay frames <url-or-file> [--at <ts1,ts2,...> | --from <ts> --to <ts> [--count <n>] [--strategy interval|scene]] [--max-edge <px>] [--quality <1-100>] [--phash] [--scene-safety-cap <n>] [--run-id <id>] [--json]
zakira-replay clip <url-or-file> --start <timestamp> --end <timestamp> [--run-id <id>] [--output-name <name>]
zakira-replay search build <run-directory> [--backend json|sqlite|sqlite-onnx]
zakira-replay search query <run-directory-or-index> <query> [--top <n>] [--backend auto|json|sqlite|sqlite-onnx]
zakira-replay chapters build <run-directory> [--min-duration <seconds>] [--max-duration <seconds>]
zakira-replay align <run-directory>
zakira-replay discover <url> [--browser] [--output <path>]
zakira-replay batch run <manifest.json>
zakira-replay queue enqueue <url-or-file> [analysis options] [--queue-id <id>] [--job-id <id>] [--retries <n>]
zakira-replay queue run [--queue-id <id>] [--concurrency <n>] [--retries <n>]
zakira-replay queue status [--queue-id <id>] [--json]
zakira-replay deps install [yt-dlp|ffmpeg|ffprobe|onnx|ocr|whisper-model|diarization|media|all] [--whisper-model tiny|base|small|medium|large-v3|large-v3-turbo] [--force]
zakira-replay deps path
zakira-replay auth login <profile-name> [--url <start-url>]
zakira-replay auth init-edge-profile [--url <start-url>] [--user-data-dir <path>] [--profile-directory <name>]
zakira-replay auth list
zakira-replay auth show <profile-name>
zakira-replay auth clear <profile-name>
zakira-replay auth path [profile-name]
zakira-replay config <path|list|get|set> ...
zakira-replay mcp serve

Defaults

Out of the box, zakira-replay analyze <url> produces:

Knob Default Override
Frame strategy scene (ffmpeg scene-change boundaries) --frame-strategy interval\|every-frame
Frame count (interval/every-frame only) 500 --frames <n>
Frames per minute (interval-strategy floor) 12 (frames.perMinute config) --frames-per-minute <n>; pass 0 to disable scaling
Scene safety cap 5000 scene frames --scene-safety-cap <n> or frames.sceneSafetyCap config
Max AI frames (OCR/vision per slide cap) 50 --max-ai-frames <n>
OCR provider local (RapidOCR via ONNX, no LLM, no network) --ocr-provider copilot to route through an LLM
OCR model auto-download on (ocr.local.autoDownload=true) — first OCR run silently fetches ~30 MB of PP-OCRv5 latin models from ModelScope zakira-replay config set ocr.local.autoDownload false, or pre-install with deps install ocr
Run ID (when --run-id omitted) deterministic: <source-slug>-<sha8> (e.g. https-www-youtube-com-watch-v-abc-a3f9c2e1). Same source URL always lands in the same run folder, so --cache reuse works without an explicit run-id --run-id <name> to pin
Smart-crop off --smart-crop
Capture mode ytdlp (yt-dlp + ffmpeg) --capture-mode browser\|auto

The cache key (runs/.cache/<sha256>.json) is computed from the full request shape — OcrProvider, SmartCrop, SmartCropProfile, CaptureMode, and AuthProfile are part of it, so flipping any of these correctly invalidates prior cached runs.

Dependency Configuration

Zakira.Replay resolves dependency paths in this order:

  • Environment variable override.
  • User config file.
  • Portable dependency directory.
  • PATH or known install locations.

Portable installs are opt-in. Run zakira-replay deps install media to install portable yt-dlp, ffmpeg, and ffprobe into the configured portable directory, zakira-replay deps install onnx to download the ONNX search model files, zakira-replay deps install ocr [--language <pack>] to download a RapidOCR PP-OCRv5 language pack for the local OCR provider (default latin; other packs: chinese, english, korean, cyrillic, arabic, devanagari, greek, telugu, tamil), zakira-replay deps install whisper-model [--whisper-model <size>] to download a Whisper ggml model for the --llm-provider local-whisper STT path, or zakira-replay deps install diarization to download the pyannote-segmentation-3.0 and 3D-Speaker ONNX models used by the --diarize flag. zakira-replay deps install defaults to media; use all to install media tools, ONNX search models, the configured OCR language pack, the default Whisper model, and the diarization models.

Auto-download flags

Zakira.Replay has six independent autoDownload flags. System tools and the search model are opt-in (default false); local-provider models are opt-out (default true). Each flag triggers fetch-on-first-use without requiring an upfront deps install run:

Flag Default Triggers when
dependencies.autoDownload false yt-dlp / ffmpeg / ffprobe are needed and missing (portable ffmpeg is Windows-x64 only)
search.onnx.autoDownload false The sqlite-onnx search backend needs the all-MiniLM-L6-v2 embedding model
ocr.local.autoDownload true A local OCR run needs the RapidOCR PP-OCRv5 language pack
llm.localWhisper.autoDownload true --llm-provider local-whisper needs a Whisper ggml model
diarization.autoDownload true --diarize needs the sherpa-onnx (pyannote + 3D-Speaker) models
vision.local.autoDownload true --vision-provider local needs CLIP / Florence ONNX models

Set any flag with zakira-replay config set <flag> <true|false>. To go strictly offline, flip the four local-provider flags to false and pre-install everything with zakira-replay deps install all.

The global config path is resolved in this order:

  • ZAKIRA_REPLAY_CONFIG_PATH.
  • $XDG_CONFIG_HOME/Zakira.Replay/Zakira.Replay.json when XDG_CONFIG_HOME is set.
  • The platform app-data fallback, such as %APPDATA%\Zakira.Replay\Zakira.Replay.json on Windows.

For compatibility, on first load Zakira.Replay performs a one-time migration: if a legacy VideoWatcher\VideoWatcher.json (or VideoWatcher.config / config.json) sits next to the new Zakira.Replay directory under your config root, its contents are copied to Zakira.Replay\Zakira.Replay.json and the legacy file is removed.

Environment variables:

  • ZAKIRA_REPLAY_YTDLP_PATH
  • ZAKIRA_REPLAY_FFMPEG_PATH
  • ZAKIRA_REPLAY_FFPROBE_PATH
  • ZAKIRA_REPLAY_EDGE_PATH
  • ZAKIRA_REPLAY_PORTABLE_DIRECTORY
  • ZAKIRA_REPLAY_ONNX_MODEL_PATH
  • ZAKIRA_REPLAY_ONNX_VOCAB_PATH
  • ZAKIRA_REPLAY_ONNX_MODEL_DIRECTORY
  • ZAKIRA_REPLAY_ONNX_MODEL_FILE
  • ZAKIRA_REPLAY_ONNX_MAX_SEQUENCE_LENGTH
  • ZAKIRA_REPLAY_ONNX_EMBEDDING_DIMENSIONS
  • ZAKIRA_REPLAY_OCR_PROVIDER
  • ZAKIRA_REPLAY_OCR_LANGUAGE_PACK
  • ZAKIRA_REPLAY_OCR_MODEL_DIRECTORY
  • ZAKIRA_REPLAY_OCR_DETECTION_MODEL_PATH
  • ZAKIRA_REPLAY_OCR_CLASSIFICATION_MODEL_PATH
  • ZAKIRA_REPLAY_OCR_RECOGNITION_MODEL_PATH
  • ZAKIRA_REPLAY_OCR_DICTIONARY_PATH
  • ZAKIRA_REPLAY_AUTH_DIRECTORY
  • ZAKIRA_REPLAY_EDGE_USER_DATA_DIR
  • ZAKIRA_REPLAY_LLM_PROVIDER
  • ZAKIRA_REPLAY_OLLAMA_ENDPOINT
  • ZAKIRA_REPLAY_OLLAMA_MODEL
  • ZAKIRA_REPLAY_OLLAMA_VISION_MODEL
  • OLLAMA_HOST (Ollama's standard env var; honoured as a fallback for the endpoint)
  • ZAKIRA_REPLAY_WHISPER_MODEL_PATH
  • ZAKIRA_REPLAY_WHISPER_MODEL_DIRECTORY
  • ZAKIRA_REPLAY_WHISPER_MODEL_SIZE
  • ZAKIRA_REPLAY_WHISPER_LANGUAGE
  • ZAKIRA_REPLAY_WHISPER_THREADS
  • ZAKIRA_REPLAY_WHISPER_AUTODOWNLOAD
  • ZAKIRA_REPLAY_DIARIZATION_PROVIDER
  • ZAKIRA_REPLAY_DIARIZATION_MODEL_DIRECTORY
  • ZAKIRA_REPLAY_DIARIZATION_SEGMENTATION_MODEL_PATH
  • ZAKIRA_REPLAY_DIARIZATION_EMBEDDING_MODEL_PATH
  • ZAKIRA_REPLAY_DIARIZATION_NUM_SPEAKERS
  • ZAKIRA_REPLAY_DIARIZATION_THRESHOLD
  • ZAKIRA_REPLAY_DIARIZATION_MIN_DURATION_ON
  • ZAKIRA_REPLAY_DIARIZATION_MIN_DURATION_OFF
  • ZAKIRA_REPLAY_DIARIZATION_THREADS
  • ZAKIRA_REPLAY_DIARIZATION_AUTODOWNLOAD
  • HF_TOKEN (optional — used by deps install whisper-model to lift Hugging Face download rate limits)
  • OPENAI_API_KEY
  • OPENAI_BASE_URL
  • OPENAI_MODEL
  • OPENAI_TRANSCRIPTION_MODEL
  • AZURE_OPENAI_ENDPOINT
  • AZURE_OPENAI_API_KEY
  • AZURE_OPENAI_DEPLOYMENT
  • AZURE_OPENAI_MODEL
  • AZURE_OPENAI_API_VERSION

User config commands:

zakira-replay config path
zakira-replay config list
zakira-replay deps path
zakira-replay deps install media
zakira-replay deps install onnx
zakira-replay config set yt-dlp.path C:\tools\yt-dlp\yt-dlp.exe
zakira-replay config set ffmpeg.path C:\tools\ffmpeg\bin\ffmpeg.exe
zakira-replay config set dependencies.autoDownload true
zakira-replay config set dependencies.portableDirectory C:\tools\zakira-replay
zakira-replay config set search.onnx.modelPath C:\models\embedding.onnx
zakira-replay config set search.onnx.vocabularyPath C:\models\vocab.txt
zakira-replay config set search.onnx.autoDownload true
zakira-replay config set search.onnx.modelDirectory C:\models\all-MiniLM-L6-v2
zakira-replay config set llm.provider openai
zakira-replay config set llm.openai.model gpt-4o-mini
zakira-replay config set llm.openai.apiKeyEnvVars OPENAI_API_KEY,WORK_OPENAI_API_KEY
zakira-replay config set llm.azureOpenAi.endpoint https://example.openai.azure.com
zakira-replay config set llm.azureOpenAi.deployment video-analysis
zakira-replay config set llm.azureOpenAi.apiKeyEnvVars AZURE_OPENAI_API_KEY,WORK_AZURE_OPENAI_API_KEY
zakira-replay config set captions.languages auto
zakira-replay config set captions.languages fr,en,live_chat
zakira-replay config set ocr.provider local
zakira-replay config set ocr.local.modelDirectory C:\models\rapidocr
zakira-replay config set ocr.local.autoDownload true
zakira-replay config set frames.sceneSafetyCap 5000
zakira-replay config set frames.perMinute 12
zakira-replay config set crop.enabled true
zakira-replay config set crop.profile auto
zakira-replay config set capture.mode auto
zakira-replay config set capture.browser.seekWaitSeconds 3
zakira-replay config set auth.directory C:\secrets\zakira-auth
zakira-replay config set auth.staleThresholdMinutes 120
zakira-replay config get yt-dlp.path

If the value passed to config set is a directory, Zakira.Replay appends the expected executable name.

Caption Languages

Caption preferences default to ["auto"], which unions the source's primary language, the languages with manually uploaded subtitles (per yt-dlp's info.subtitles), English (en, en.*), and YouTube live-chat replay so an existing transcript is found whenever yt-dlp knows of one. YouTube auto-translation languages (those that appear only under info.automatic_captions and not under info.subtitles) are intentionally not expanded by auto, because they are translations inferred from the source rather than facts about what was spoken. To opt into a specific auto-translation, request it explicitly with --caption-languages es (CLI), captionLanguages: ["es"] (MCP/batch), or zakira-replay config set captions.languages es. The languages yt-dlp advertises for a source are written to metadata.json under availableSubtitleLanguages, with hasManual / hasAuto flags per language so orchestrators can branch on what is actually available before retrying.

Speakers

When captions carry speaker tags, Zakira.Replay extracts them as facts, not synthesis:

  • VTT voice spans <v Speaker Name>...</v> (and self-terminating <v Name> lines).
  • SRT line prefixes Speaker Name: utterance (only when the prefix shape looks like a name).
  • Bracketed prefixes [Speaker Name] utterance.

Each transcript[*] segment carries speakerId (slugified, stable) and speakerDisplayName (verbatim from the source). A per-speaker registry is written under evidence.speakers[] with segmentCount, totalSeconds, firstSeenSeconds, and lastSeenSeconds. Transcript normalization treats speaker changes as hard boundaries: two near-duplicate utterances by different speakers are kept separate. Speakers are never invented; segments without a recognisable tag carry null for both fields.

STT-derived transcripts do not carry speakers in this phase. Provider-backed diarization is out of scope for this release; the schema fields are stable so a future phase can plug in cloud or local diarization without breaking consumers.

STT Chunking

Speech-to-text on long audio is silence-chunked before each provider call to stay under per-request size limits (for example OpenAI Whisper's 25 MB cap). Audio shorter than the configured target duration is sent in one shot. When chunking actually splits the audio:

  • Boundaries snap to the centre of ffmpeg silencedetect windows nearest each target step, falling back to a hard cut when no usable silence exists.
  • Each chunk is re-encoded as 16 kHz mono PCM under audio/chunks/chunk-NNN.wav.
  • Per-chunk transcript responses have their timestamps shifted by the chunk's start offset so downstream consumers continue to see one continuous timeline.
  • A audio/chunks/chunks.json artifact records chunk metadata and detected silence windows (schema audio-chunks.schema.json).
  • Per-chunk failures are recorded as structured warnings (STT_CHUNK_FAILED) instead of failing the whole run.

Slides

Frames are perceptually hashed (64-bit dHash via ffmpeg, no managed image library required) and adjacent frames within a Hamming distance threshold are grouped into slides. Slides are facts about visible-content continuity: an orchestrator can answer "when was slide X visible?" by reading firstSeenSeconds/lastSeenSeconds directly from evidence.slides[] (also written to slides/slides.json).

OCR and vision run once per slide (the slide's primaryFrameId), not per individual frame, so a 60-minute talk with 30 scene frames typically pays for far fewer LLM calls. Each OcrFrameResult and VisionFrameResult carries a slideId reference back to its slide.

Tunables:

  • slides.enabled (default true) — set false to disable grouping; every frame becomes its own slide.
  • slides.hashDistance (default 6, range 0-64) — maximum Hamming distance between adjacent dHash values still considered the same slide.
  • CLI: --no-slide-grouping and --slide-hash-distance <n>.
  • MCP: slideGrouping: false and slideHashDistance: <n>.

Frame Budgeting

--frames N is a per-strategy parameter, not a global density:

Strategy What --frames N produces
interval (default) exactly N frames spaced evenly across the duration
scene up to frames.sceneSafetyCap (default 5000) scene-cut frames; --frames is ignored. Slide grouping deduplicates the unbounded stream so OCR/vision cost still scales with unique slides only
every-frame the first N decoded frames of the video (a debug/inspection tool)

For long videos, --frames 30 with the interval strategy means a frame every duration/30 seconds — likely too sparse for a 40-minute video. Two ways to densify:

  • --frames-per-minute <n> (CLI), framesPerMinute (MCP/batch). Scales the count by duration; --frames becomes the floor: effective = max(framesPerMinute * durationMinutes, --frames). Ignored for scene and every-frame.
  • --scene-safety-cap <n> (CLI), sceneSafetyCap (MCP/batch), or frames.sceneSafetyCap (config) raises the upper bound on scene-strategy extraction. The default 5000 is generous for typical talks and slide-heavy demos.

If a run looks undersampled (fewer than 1 frame per 5 minutes for the interval strategy without --frames-per-minute and with frames.perMinute=0 in config), Zakira.Replay emits a FRAMES_LIKELY_UNDERSAMPLED warning naming the actual ratio. When the scene safety cap is reached, it emits FRAMES_SCENE_CAP_REACHED. Both are facts; orchestrators can branch on the codes.

Structured OCR/Vision

OCR and vision prompts ask the model to return strict JSON. Each OcrFrameResult.Structured carries { freeText, lines[], tables[] }; each VisionFrameResult.Structured carries { kind, title?, bullets[], codeBlocks[], charts[], uiElements[], freeText }. When the model returns prose instead of JSON, a tolerant fallback stores the raw text under freeText and a structured warning (OCR_PARSE_FALLBACK / VISION_PARSE_FALLBACK) is emitted so orchestrators can branch.

Per-frame artifacts are also written for direct loading without parsing evidence.json:

  • ocr/{frameId}.jsonocr.schema.json
  • vision/{frameId}.jsonvision.schema.json

OCR Providers

OCR can run through one of two providers, selectable per-run with --ocr-provider:

  • copilot (default) — routes the image through the configured LLM (GitHub Copilot, OpenAI, or Azure OpenAI) using vision-capable chat models. Produces high-quality structured OCR including the lines[] and tables[] fields when the model returns strict JSON.
  • local — runs entirely on the local machine via RapidOcrNet (PP-OCRv5 latin) over Microsoft.ML.OnnxRuntime. No LLM call, no network at run-time, no per-frame latency cost beyond decoding and ONNX inference. Lower-fidelity than a frontier vision model (no tables[] reconstruction in this release) but offline and reliable.

Both providers return the same JSON shape; OcrFrameResult.Provider records which one produced each result. The pipeline writes the same ocr/{frameId}.json and ocr/combined.md artifacts regardless of provider.

Set the default provider once:

zakira-replay config set ocr.provider local

Or override per run:

zakira-replay analyze "<url>" --frames 7 --frame-strategy scene --ocr --ocr-provider local --cache

Install the local models (~30 MB, four files: detection ONNX, classification ONNX, recognition ONNX, character dictionary):

zakira-replay deps install ocr
zakira-replay deps path     # prints the resolved OCR model paths

Models are stored under <portable-dir>/models/rapidocr-ppocrv5-latin/ by default. Override with ocr.local.modelDirectory in config or ZAKIRA_REPLAY_OCR_MODEL_DIRECTORY. Individual file paths can be overridden with ocr.local.detectionModelPath, ocr.local.classificationModelPath, ocr.local.recognitionModelPath, and ocr.local.dictionaryPath (or the corresponding ZAKIRA_REPLAY_OCR_* env vars).

Warning codes emitted by the local provider:

  • OCR_LOCAL_MODELS_MISSING — one or more of the four model files were not found at resolution time. Run deps install ocr.
  • OCR_LOCAL_INIT_FAILED — ONNX session construction or RapidOCR initialisation failed.
  • OCR_LOCAL_INFERENCE_FAILED — a single frame failed to OCR; the run continues with the remaining frames.
  • OCR_UNKNOWN_PROVIDER — the requested provider name normalised to a value that is neither copilot nor local.

--ocr-instruction is ignored by the local provider (the engine extracts every visible character regardless), but the instruction is still persisted to evidence.json and manifest.json for audit.

Smart Crop (Teams/Zoom/WebEx)

Meeting-platform recordings (Teams, Zoom, WebEx, etc.) wrap slide content with UI chrome: a controls bar at the top, a participant gallery on the right, black letterbox bars, and a slide-navigation strip at the bottom. That chrome wastes 30-50% of every frame, dilutes the perceptual-hash signal used for slide grouping, and pollutes OCR output with meeting-app vocabulary. Enable smart-crop to strip it before downstream stages run:

zakira-replay analyze "C:\meetings\team-sync.mp4" --frames 12 --frame-strategy scene --ocr --vision --smart-crop

Or set the default once:

zakira-replay config set crop.enabled true
zakira-replay config set crop.profile auto

The reference algorithm (ported from the conference-book-of-news SKILL) runs four passes on each frame in order:

  1. Top/bottom letterbox: trim solid black bars from the top and bottom.
  2. Controls bar: find a fully-bright row in the first 80 px of remaining content (the meeting-app control strip) and trim past it.
  3. Participant gallery sidebar: scan from 90 % → 60 % of the width for a thin bright strip with darker content to its left. Crop at the strip.
  4. Bottom navigation: unconditional 25 px trim.

The cropped frame is written to frames/<frameId>-cropped.jpg. The FrameArtifact records the source dimensions, the resulting Width/Height, the Crop rectangle, and OriginalPath pointing back at the source. Downstream stages (perceptual hash, slide grouping, OCR, vision) read Path opaquely and automatically see the cropped frame.

Profiles (--smart-crop-profile):

  • auto (default), generic, teams, zoom, webex — share the same algorithm in this release; the value is recorded on each FrameCropBox.Source (smart-crop-teams, smart-crop-auto, etc.) for audit and so future platform-specific tunings can branch on it.
  • off — disable smart-crop regardless of --smart-crop or crop.enabled.

Safety: if the candidate crop would remove more than 50 % of the width or leave less than 30 % of the height, the original frame is retained and a CROP_BAIL_OUT (severity info) is emitted. This prevents the algorithm from over-cropping non-meeting content (slide-only recordings, screen captures without UI chrome, etc.).

Warning codes emitted by smart-crop:

  • CROP_IMAGE_DECODE_FAILED — could not decode a frame (e.g. missing file).
  • CROP_BAIL_OUT — safety threshold tripped; original frame retained.
  • CROP_PROFILE_UNKNOWN — the requested profile name is not recognised; falls back to auto.
  • CROP_OUTPUT_FAILED — failed to write the cropped JPG to disk.

Frame Capture Modes

--capture-mode (or capture.mode in config) selects how frames are pulled out of the source:

  • ytdlp (default) — resolve a direct media URL with yt-dlp and extract frames with ffmpeg. Works for the ~1000 sites yt-dlp supports plus local media files; cheap, fast, no browser required.
  • browser — drive a Playwright-controlled Chromium pinned to the user's Edge install (edge.path) to navigate the page, click play, poll video.duration, seek with video.currentTime, and screenshot the <video> element at evenly-spaced timestamps. Use for sites yt-dlp can't reach: custom enterprise portals, Medius/Teams recordings, dynamic players whose URL only serves a fully-rendered SPA.
  • auto — try yt-dlp first; if it can't resolve a direct media URL, fall back to browser and emit a CAPTURE_BROWSER_FALLBACK info-level warning so orchestrators can audit which path was used.
# Force browser capture for an authenticated SharePoint portal
zakira-replay analyze "https://corp.sharepoint.com/sites/.../watch/abc" --capture-mode browser --frames 7 --ocr --vision

# Let Zakira.Replay decide; safe to use as a default
zakira-replay analyze "https://example.com/some-video" --capture-mode auto --frames 7 --cache

Browser-mode tunables (config keys; CLI access is limited to --capture-mode for now):

  • capture.browser.playButtonSelector — CSS or Playwright locator for the play button. When null, the client tries video.play() on the element matching videoElementSelector, then falls back to the first button[aria-label*='play' i].
  • capture.browser.videoElementSelector — CSS selector for the <video> element. Defaults to video.
  • capture.browser.seekWaitSeconds — wait after video.currentTime = ... before screenshotting. The reference SKILL uses 2.5s (1.0 too fast, 2.0 mostly works, 2.5 reliable). Raise to 3.0-4.0 for HD videos or slower machines.
  • capture.browser.durationProbeTimeoutSeconds — max wait for video.duration to become a finite number (defaults to 20s).
  • capture.browser.jpegQuality — JPEG quality for screenshots written to frames/scene-NNNN.jpg (defaults to 90).
  • capture.browser.captureCaptions — when true (default), attach a network listener while the page is loaded and the video is played, capturing every .vtt / .srt response into captions/browser-NNNN.vtt and recording an inventory at captions/discovered.json. When the run had no transcript otherwise (no yt-dlp captions, no sidecar, no STT), the best-language match is used to populate transcript.md retroactively.
  • capture.browser.maxCaptionBytes — safety cap on the size of any single captured caption file (default 5 MiB). Larger responses are skipped with CAPTIONS_BROWSER_NETWORK_DOWNLOAD_FAILED.

Browser-discovered captions

When --capture-mode browser (or auto and the browser path was used), the Playwright network interceptor watches every response that comes off the wire while the page loads and the video plays. Anything whose URL ends in .vtt or .srt (case-insensitive, after stripping query strings) is captured: the body is downloaded, deduplicated by SHA-256, and written to captions/browser-NNNN.vtt.

Each capture is recorded with:

  • The original network URL with all query-string parameters intact (so SAS tokens and language selectors stay auditable).
  • An inferred BCP-47 language code, when one can be guessed from the URL. The heuristics, tried in order, are: Microsoft Medius Caption_<lang>.vtt paths, generic <sep>xx[-XX].vtt filenames (2-letter primary), /captions/<lang>/-style path segments, and ?lang= / ?hl= / ?language= / ?l= / ?tlang= query strings.
  • The heuristic that produced the language tag (url-Caption_<lang>, url-filename, url-path-segment, url-query-lang, …), so false positives are easy to triage.
  • Byte count, content-type, SHA-256 hash for cross-run dedupe.

The full inventory is written to captions/discovered.json (schema: captions-discovered.schema.json). When the pipeline reaches the transcript step with transcript == null (no yt-dlp captions, no sidecar, STT was either not requested or also failed), the best-language match is selected using the same --caption-languages resolution that yt-dlp uses (so info.Language from the source's metadata is the "main"/"original" hint), parsed via the same SubtitleConverter, and persisted to transcript.md. The TranscriptArtifact.Kind for these is browser-network.

If no captions were observed during browser playback, a CAPTIONS_BROWSER_NETWORK_NONE (severity info) is emitted so orchestrators can branch.

Warning codes specific to browser caption capture:

  • CAPTIONS_BROWSER_NETWORK_NONE — browser capture ran but no caption response was observed.
  • CAPTIONS_BROWSER_NETWORK_DOWNLOAD_FAILED — a single caption response failed to download (timeout, oversize body, transient Playwright error). Other captures continue.
  • CAPTIONS_BROWSER_NETWORK_PARSE_FAILED — a captured caption file could not be parsed as VTT/SRT or parsed to zero segments. Pipeline continues with no transcript fill.

Browser-captured media for STT fallback

When --stt is requested AND no inline captions were intercepted, browser capture additionally observes media-shaped responses (video/*, audio/*, HLS / DASH manifests) during playback. After the existing capture finishes, it picks the largest candidate URL and re-downloads it via the authenticated Playwright context (so SharePoint Stream's SAS-token cookies travel with the request). The downloaded file lands at media/browser-fetched.<ext> in the run dir; ffmpeg then extracts an audio track and Whisper STT runs as if the audio had come from yt-dlp.

This is a "fit-for-purpose" fallback, not a general media downloader:

  • Works for sites that hand back a single addressable media URL (typical SharePoint Stream pattern when the recording was uploaded as a single MP4).
  • Does NOT work for HLS / DASH chunked streams: audio is split across hundreds of small .m4s fragments with no single addressable URL. Manifest parsing + segment reassembly is out of scope for now.
  • Does NOT work for DRM-protected streams (rare for internal corporate recordings).

The media-collection side-channel is off by default and only activates when:

  • --stt was requested (request.UseSpeechToText == true), AND
  • The transcript step found no captions/subtitles, AND
  • No audio was otherwise resolved (no yt-dlp media URL, no sidecar)

When the fallback runs but no candidate URL is observed, a CAPTURE_BROWSER_MEDIA_NO_CANDIDATE (info) tells the orchestrator STT was skipped because the player streamed in fragments. When a candidate is found but the authenticated re-download fails (HTTP error, oversize, timeout), CAPTURE_BROWSER_MEDIA_DOWNLOAD_FAILED (warning) fires.

Warning codes specific to browser media capture:

  • CAPTURE_BROWSER_MEDIA_DOWNLOADED (info) — media file downloaded successfully; STT will run against it.
  • CAPTURE_BROWSER_MEDIA_NO_CANDIDATE (info) — no single-file media URL was observed (chunked stream); STT will be skipped.
  • CAPTURE_BROWSER_MEDIA_DOWNLOAD_FAILED (warning) — authenticated re-download failed; STT will be skipped.

Diagnostic capture (--capture-debug)

For reverse-engineering a vendor-specific player, pass --capture-debug to analyze (or zakira-replay config set capture.browser.debug true for a persistent default). During the existing browser-capture session, this writes a side-channel diagnostic dump under runs/<run-id>/debug/:

runs/<run-id>/debug/
\u251c\u2500\u2500 network.log                          # JSONL: one row per response
\u251c\u2500\u2500 network.har                          # Playwright-recorded HAR (load into DevTools)
\u251c\u2500\u2500 texttracks-state.json                # snapshot of <video>.textTracks post-activation
\u2514\u2500\u2500 metadata-responses/
    \u251c\u2500\u2500 0042-3a9b2f17.json               # full body of every JSON/XML/text response
    \u251c\u2500\u2500 \u2026                                #  under `capture.browser.debugMaxBodyBytes` (default 1 MB)
    \u2514\u2500\u2500 index.json                       # URL \u2192 body file map with SHA-256s

The recorder doesn't affect capture behaviour \u2014 strictly side-channel, dropped silently if any individual body fails to fetch. Binary bodies (video/audio/octet-stream) are logged but not persisted to disk to keep the dump compact. Configurable cap:

zakira-replay config set capture.browser.debug true
zakira-replay config set capture.browser.debugMaxBodyBytes 5242880   # 5 MB per body

Useful when adding support for a new player: you can capture once, then offline-inspect what URLs the player fetches, where caption data lives in the metadata responses, and what shape it takes (inline cues, external .vtt URLs at a non-standard path, TTML, JSON, etc.).

SharePoint Stream / Microsoft Stream transcripts

SharePoint Stream's player (StreamWebApp / OnePlayer) is more involved than a generic HTML5 video player and warrants a dedicated note. Three things are non-standard:

  1. Captions aren't stored in textTracks. Setting track.mode = "showing" does nothing useful because the entries on <video>.textTracks are UI stubs whose cues arrays never populate — the actual captions live in Stream's React/SPA state.
  2. Captions aren't fetched as .vtt/.srt URLs. The standard network interceptor sees nothing matching that pattern even when transcripts exist.
  3. Media is served as DASH-style fragmented MP4 with AES-128-CBC encryption. Direct download of the audio is non-trivial; the existing CAPTURE_BROWSER_MEDIA_NO_CANDIDATE warning fires because no single-file media URL ever appears on the wire.

Zakira works around all three by recognising the Stream player's transcripts-metadata API call:

GET /personal/{upn}/_api/v2.X/drives/{drive-id}/items/{item-id}?select=media/transcripts,audioTracks&$expand=media/transcripts,media/audioTracks

The JSON response lists every transcript attached to the recording with a temporaryDownloadUrl per transcript. Zakira follows each URL via the authenticated Playwright context (Edge profile cookies), and tries multiple URL variants in priority order to coax out the richest format:

  1. ?isformatjson=true&transcriptkey=<id> — the exact query the Stream player itself uses. Returns the full Microsoft Teams transcript JSON ($schema:transcript.json) with speakerDisplayName, speakerId, confidence, roomId, and ISO 8601 startOffset/endOffset per entry. This is the one with speakers.
  2. ?$format=json — OData content-negotiation hint.
  3. ?format=json — non-OData fallback.
  4. Plain URL — last resort, returns a stripped public WebVTT (no speakers).

When the rich JSON is obtained, Zakira converts to standard WebVTT with proper <v Speaker> voice spans, preserving speaker attribution through to SubtitleConverter. Output:

[00:00:06.372 - 00:00:09.572] [Liad Shiran] Hello, good morning, everyone.
[00:00:11.112 - 00:00:17.912] [Boris Forzun] Let's get started.

If the player happens not to make the transcripts-metadata API call itself during automation (observed varying by recording), Zakira proactively queries it using the (drive-id, item-id) harvested from any other SharePoint REST call observed on the same item (labelPolicies, analytics/allTime, etc.) — so Stream support works regardless of player behaviour.

This activates automatically — no flag needed. As long as you've initialised an Edge profile via auth init-edge-profile and signed into SharePoint, browser-capture against any *.sharepoint.com/.../stream.aspx?id=... URL produces a real, speaker-attributed transcript when one exists.

Both auto-generated Teams captions and manually uploaded transcripts work. If multiple transcripts are attached (e.g., English + machine-translated French), all are downloaded; the existing transcript-fill logic picks the best-language match for transcript.md.

Warning codes specific to Stream:

  • CAPTURE_STREAM_TRANSCRIPT_DISCOVERED (info) — metadata response observed, N transcripts listed.
  • CAPTURE_STREAM_TRANSCRIPT_DOWNLOADED (info) — per-transcript download succeeded.
  • CAPTURE_STREAM_METADATA_PARSE_FAILED (warning) — response body wasn't recognisable JSON / media.transcripts[] shape.
  • CAPTURE_STREAM_TRANSCRIPT_PARSE_FAILED (warning) — transcript body downloaded but didn't convert to WebVTT (unknown shape); raw body kept under captions/.

If a recording has no transcript at all (auto-captioning was disabled for the meeting, or it's not a Teams recording), STT fallback would be the next step — but Stream's audio is DRM-encrypted (DASH urn:mpeg:dash:sea:aes128-cbc:2013), so audio-only download requires decryption that Zakira doesn't currently ship. The CAPTURE_BROWSER_MEDIA_NO_CANDIDATE warning makes the gap clear and the diagnostic dump captures the DASH manifest if you want to investigate further.

Reusing a Dedicated Edge Profile (Persistent Context)

Auth profiles store cookies as a plaintext Playwright StorageState JSON file. That file is portable — copy it to another machine and it works there too. Convenient, but every leaked StorageState is a complete drop-in session credential.

The alternative is to point Zakira.Replay at a dedicated Microsoft Edge user-data-dir. Edge stores cookies in its native SQLite, with the sensitive columns DPAPI-encrypted per-user, per-machine on Windows (Keychain on macOS, libsecret/KWallet on Linux). A leaked Edge profile is unreadable on a different machine. Cookies refresh in place during normal use, so the 1-hour StorageState refresh cycle goes away. This is the recommended approach for SharePoint Stream / Microsoft Stream / authenticated Microsoft 365 portals.

One-command setup (per machine)

# Launch Edge against the dedicated user-data-dir, sign in interactively, close Edge.
# Zakira verifies the profile is initialised and reports the cookie path.
zakira-replay auth init-edge-profile --url https://microsofteur-my.sharepoint.com/

# Confirm Zakira sees a ready profile:
zakira-replay doctor       # \u2192 edge-profile: ready (...edge-profile, Default, ...)

After step 1, your browser-capture analyses automatically use persistent-context mode whenever the configured profile directory contains a Cookies file. No CLI flag needed:

zakira-replay analyze "https://microsofteur-my.sharepoint.com/.../stream.aspx?id=..." \
    --capture-mode browser --smart-crop --smart-crop-profile teams \
    --stt --llm-provider local-whisper --ocr --vision --vision-provider local --cache

On-disk layout

%LOCALAPPDATA%\Zakira.Replay\edge-profile\          \u2190 user-data-dir (the directory)
\u251c\u2500\u2500 Local State                              \u2190 user-data-dir metadata
\u251c\u2500\u2500 First Run
\u2514\u2500\u2500 Default\                                 \u2190 profile sub-folder (the name)
    \u251c\u2500\u2500 Network\Cookies                       \u2190 DPAPI-encrypted SQLite
    \u251c\u2500\u2500 Login Data                            \u2190 DPAPI-encrypted (if you saved passwords)
    \u2514\u2500\u2500 \u2026
  • capture.browser.edgeUserDataDir — absolute path to the user-data-dir. Stored verbatim (env-var literals like %LOCALAPPDATA% are preserved) so the config travels between machines; expansion happens at read time. Default: %LOCALAPPDATA%\Zakira.Replay\edge-profile.
  • capture.browser.edgeProfileDirectory — sub-folder name. Default "Default"; only change this if you've manually created multiple profiles inside the same user-data-dir.

Cross-machine workflow

Zakira config syncs across machines fine (the env-var literal in edgeUserDataDir expands per-machine). The Edge profile contents do not sync — DPAPI keys are per-user, per-machine, so even if you copied the directory it wouldn't be usable elsewhere. This is the property that makes the dedicated-profile approach more secure than StorageState.

On a new machine: zakira-replay auth init-edge-profile --url <site> once, then zakira-replay analyze works.

If you forget the init step, the analyze run prints:

[CAPTURE_BROWSER_PROFILE_NOT_INITIALIZED] (info) Edge profile at <path> is not initialized.
  Run `zakira-replay auth init-edge-profile` to sign in once per machine.
  Continuing with the StorageState path for now.

If you then hit an auth-gated URL:

[CAPTURE_BROWSER_AUTH_REQUIRED] (error) Page redirected to a sign-in URL (login.microsoftonline.com/...).
  Run `zakira-replay auth init-edge-profile --url <site>` to re-sign in and retry.

No more silent CAPTURE_DURATION_UNRESOLVED timeouts when the only real problem was an expired session.

Failure-mode warning codes

  • CAPTURE_BROWSER_PROFILE_NOT_INITIALIZED (info) — no Cookies file in the configured profile sub-folder. Capture falls back to StorageState/anonymous; run auth init-edge-profile to enable persistent-context mode.
  • CAPTURE_BROWSER_PROFILE_DIR_MISSING (error) — explicit edgeUserDataDir points at a non-existent directory. Capture aborts.
  • CAPTURE_BROWSER_PROFILE_LOCKED (error) — SingletonLock present inside the profile sub-folder; Edge is already using the dir. Close Edge and retry.
  • CAPTURE_BROWSER_PROFILE_LAUNCH_FAILED (error) — LaunchPersistentContextAsync threw (corrupt profile, DPAPI key unavailable, incompatible Edge version). The Playwright exception message is included.
  • CAPTURE_BROWSER_AUTH_REQUIRED (error) — post-navigation URL matched a sign-in domain. Re-init the profile.
  • CAPTURE_BROWSER_AUTH_MFA_DETECTED (error) — page contains a Microsoft MFA challenge selector that headless capture cannot satisfy. Re-init interactively.
  • CAPTURE_PROFILE_CONFLICT (info) — both --auth-profile and an initialized edgeUserDataDir were supplied; persistent-context wins, the StorageState profile is ignored for this run.

Manual setup (if you'd rather not use the helper)

# Equivalent of `auth init-edge-profile`:
msedge.exe --user-data-dir="%LOCALAPPDATA%\Zakira.Replay\edge-profile" `
           --profile-directory=Default `
           --no-first-run --no-default-browser-check `
           https://microsofteur-my.sharepoint.com/
# Sign in, complete MFA, close Edge. Zakira will see the cookies on the next `doctor` / `analyze`.

Security comparison vs. StorageState

Threat StorageState JSON (auth login) Dedicated Edge profile (auth init-edge-profile)
Stolen laptop without disk encryption Cookies fully readable as plaintext JSON, usable on any machine until expiry Cookies unreadable (DPAPI keyed to your user+machine)
Accidentally committed to git / pushed to a remote Fully usable from anywhere Unusable on attacker's machine
OneDrive Known-Folder-Backup of the user profile Plaintext JSON syncs to cloud %LOCALAPPDATA% is excluded by default; encrypted contents wouldn't be useful anyway
Malware running as your Windows user Direct read DPAPI decrypts for the running user — same risk
Other user on the same machine Compromised if dir is world-readable Cookies unreadable (different DPAPI key)
Re-auth frequency Every ~60 min (StorageState files expire fast) Cookies refresh in-place during use; profile valid until Conditional Access forces re-auth

This is standard Chromium/Edge encryption behaviour — the same DPAPI machinery that protects your daily Edge browsing.

Evidence Alignment

zakira-replay align <run-directory> (and the MCP build_evidence_alignment tool) emits two cross-modal views under evidence-aligned/. Both files share evidence-aligned.schema.json and are pure rearrangements of evidence.json (and chapters/chapters.json when present); no model calls are made.

  • evidence-aligned/by-chapter.json — one entry per chapter, joining slideIds, transcriptSegmentIds, ocrFrameIds, visionFrameIds, and per-speaker statistics within the chapter window.
  • evidence-aligned/by-slide.json — one entry per slide, joining frameIds, the slide's ocr and vision results, transcriptSegmentIds spoken while the slide was visible, per-speaker statistics over the slide window, and the chapters the slide overlaps.

Slide visibility windows are extended to [slide[i].firstSeenSeconds, slide[i+1].firstSeenSeconds) (with the last slide covering up to evidence.durationSeconds) so the answer to "which transcript segments were spoken while slide N was on screen" matches the obvious "slide N is shown until slide N+1 appears" assumption. Run chapters build first if you want a populated by-chapter view; without it, by-chapter.json is emitted with an empty chapters[] array.

For sites that require browser cookies or an authenticated session, pass through yt-dlp auth options:

zakira-replay analyze https://example.com/video --cookies C:\path\to\cookies.txt
zakira-replay analyze https://example.com/video --cookies-from-browser edge
zakira-replay analyze https://example.com/video --browser-auth chrome

LLM calls default to the GitHub Copilot SDK. The SDK uses your existing GitHub/Copilot login. The default requested model is gpt-5.5; if unavailable, Zakira.Replay asks the SDK for available models and falls back to a suitable model.

OpenAI and Azure OpenAI can be selected with --llm-provider openai, --llm-provider azure-openai, ZAKIRA_REPLAY_LLM_PROVIDER, or llm.provider in config. OpenAI uses chat completions for text/image work and /audio/transcriptions for STT. Azure OpenAI currently supports chat/image work only; audio transcription through Azure is not wired yet.

Local Whisper STT (--llm-provider local-whisper)

For fully-local speech-to-text — no API key, no network, no quota — pick local-whisper. Zakira.Replay runs Whisper.net (managed bindings to whisper.cpp) entirely on the caller's machine and emits the same Markdown timestamps as the cloud STT paths, so chunked stitching, normalisation, evidence alignment, and search work without any other changes.

local-whisper is STT-only: it has no chat/vision/OCR surface. Compose it with --ocr-provider local for a fully-offline run, or combine with a cloud chat provider when you still need vision/OCR. Selecting local-whisper for llm ask is rejected with a clear error.

Setup (one-time, opt-in):

# Default `small` model (~466 MB, recommended balance of accuracy and speed)
zakira-replay deps install whisper-model

# Or pick a specific size
zakira-replay deps install whisper-model --whisper-model base
zakira-replay deps install whisper-model --whisper-model large-v3-turbo

Sizes available (matches the whisper.cpp Hugging Face repository): tiny, tiny.en, base, base.en, small, small.en, medium, medium.en, large-v1, large-v2, large-v3, large-v3-turbo. Set HF_TOKEN in your environment to lift Hugging Face rate limits on large downloads.

Run STT locally:

zakira-replay analyze https://example.com/video --stt --llm-provider local-whisper --ocr-provider local

Configuration keys (all optional — defaults work):

Key Default Purpose
llm.localWhisper.modelPath derived from modelSize Explicit ggml model path; overrides everything else
llm.localWhisper.modelSize small Size used to derive modelPath against the portable Whisper directory
llm.localWhisper.language auto Whisper language hint; auto enables built-in language detection
llm.localWhisper.threads null (auto) Native thread count
llm.localWhisper.autoDownload true First-run convenience; set false to require explicit deps install whisper-model …

Environment variables (override config): ZAKIRA_REPLAY_WHISPER_MODEL_PATH, ZAKIRA_REPLAY_WHISPER_MODEL_DIRECTORY, ZAKIRA_REPLAY_WHISPER_MODEL_SIZE, ZAKIRA_REPLAY_WHISPER_LANGUAGE, ZAKIRA_REPLAY_WHISPER_THREADS, ZAKIRA_REPLAY_WHISPER_AUTODOWNLOAD.

zakira-replay doctor reports the resolved model path under the synthetic whisper-model dependency.

Native runtimes: out of the box, Whisper.net.Runtime (CPU) ships with the dotnet tool. For GPU acceleration (CUDA/Vulkan/CoreML/OpenVINO), follow Whisper.net's pluggable runtime docs — the loader will pick up alternative native binaries placed under the conventional runtimes/<rid>/native/ layout.

Warning codes specific to local STT: STT_LOCAL_MODEL_MISSING, STT_LOCAL_INIT_FAILED, STT_LOCAL_INFERENCE_FAILED. Per-chunk failures still surface as STT_CHUNK_FAILED, the same way the cloud STT paths do.

Local LLM via Ollama (--llm-provider ollama)

For fully-local chat and vision — no API key, no network egress — pick ollama. Zakira.Replay talks to a running Ollama daemon through OllamaSharp, which implements Microsoft.Extensions.AI.IChatClient natively. That makes Ollama the reference path for the IChatClient abstraction the codebase now exposes internally; the rest of the providers will migrate onto the same surface in subsequent releases.

ollama is chat / vision only: it does not serve audio models. Audio attachments fail fast with a pointer to local-whisper. Combine --llm-provider ollama with --ocr-provider local (default) and --llm-provider local-whisper (configured separately for STT) for an end-to-end air-gapped pipeline.

Setup (one-time, opt-in — Ollama itself is not bundled):

# 1. Install Ollama (https://ollama.com/download) — the daemon runs locally on port 11434.
# 2. Pull a model:
ollama pull qwen2.5:7b              # general chat (default)
ollama pull llama3.2-vision:11b     # vision-capable for --ocr-provider copilot / --vision

# 3. Point Zakira.Replay at it (defaults are usually fine):
zakira-replay config set llm.ollama.endpoint http://localhost:11434
zakira-replay config set llm.ollama.model qwen2.5:7b
zakira-replay config set llm.ollama.visionModel llama3.2-vision:11b

Run analysis through Ollama:

zakira-replay analyze https://example.com/talk --ocr --vision --llm-provider ollama --ocr-provider copilot

Configuration keys:

Key Default Purpose
llm.ollama.endpoint http://localhost:11434 HTTP endpoint of the Ollama daemon
llm.ollama.model qwen2.5:7b Chat model (matches ollama pull names)
llm.ollama.visionModel null Vision-capable model used when image attachments are present; falls back to model when null
llm.ollama.timeoutSeconds 300 Per-request timeout (local inference can be slow on CPU-only machines)
llm.ollama.endpointEnvVars [ZAKIRA_REPLAY_OLLAMA_ENDPOINT, OLLAMA_HOST] Env-var names checked for the endpoint override
llm.ollama.modelEnvVars [ZAKIRA_REPLAY_OLLAMA_MODEL] Env-var names checked for the chat-model override
llm.ollama.visionModelEnvVars [ZAKIRA_REPLAY_OLLAMA_VISION_MODEL] Env-var names checked for the vision-model override

Environment variables: ZAKIRA_REPLAY_OLLAMA_ENDPOINT, ZAKIRA_REPLAY_OLLAMA_MODEL, ZAKIRA_REPLAY_OLLAMA_VISION_MODEL. OLLAMA_HOST (Ollama's own standard env var) is honoured as a secondary fallback for the endpoint.

zakira-replay doctor probes the daemon with a 2-second /api/tags request and reports the result under the synthetic ollama dependency.

IChatClient (the LLM provider abstraction)

OpenAI and Azure OpenAI providers now use Microsoft.Extensions.AI.IChatClient internally — the public ILlmProvider surface is unchanged but the underlying transport goes through the official OpenAI / Azure.AI.OpenAI SDKs and Microsoft.Extensions.AI.OpenAI. Ollama already implements IChatClient natively. Every ILlmProvider also exposes an IChatClient view via the AsChatClient() extension method (Ollama returns the native client; OpenAI/Azure/Copilot go through their own implementations). This is the migration seam for future providers (Anthropic, Gemini, vLLM, llama.cpp servers) — they plug into IChatClient and Zakira.Replay consumes them through the same surface.

OpenAI-compatible endpoints (OpenRouter, Together, Groq, vLLM, llama.cpp server, …)

Any OpenAI-compatible endpoint can be used through --llm-provider openai by overriding the base URL. No dedicated provider needed:

# OpenRouter (Claude, Gemini, Mistral, Llama, Qwen, DeepSeek, 200+ models)
export OPENAI_API_KEY=sk-or-v1-...
export OPENAI_BASE_URL=https://openrouter.ai/api/v1
zakira-replay analyze https://example.com/talk --llm-provider openai --model anthropic/claude-sonnet-4

# Groq
export OPENAI_API_KEY=gsk_...
export OPENAI_BASE_URL=https://api.groq.com/openai/v1
zakira-replay analyze https://example.com/talk --llm-provider openai --model llama-3.3-70b-versatile

# Self-hosted vLLM / llama.cpp server
export OPENAI_API_KEY=anything
export OPENAI_BASE_URL=http://localhost:8000/v1
zakira-replay analyze https://example.com/talk --llm-provider openai --model your-model-id

The openai provider routes everything through Microsoft.Extensions.AI.OpenAI's IChatClient; the official OpenAI SDK respects OpenAIClientOptions.Endpoint exactly like it would for api.openai.com.

Local OCR Language Packs

The local OCR provider (--ocr-provider local, the default) ships with the latin pack out of the box. RapidOCR PP-OCRv5 also publishes recognition models + dictionaries for nine other scripts; switch packs to extract non-Latin text from frames:

# Install a different pack (detection + classification are shared across packs — they download once):
zakira-replay deps install ocr --language chinese
zakira-replay deps install ocr --language korean
zakira-replay deps install ocr --language arabic

# Select which pack analysis runs use:
zakira-replay config set ocr.local.languagePack chinese
# Or per-process: ZAKIRA_REPLAY_OCR_LANGUAGE_PACK=chinese

Supported packs (all PP-OCRv5):

Pack Aliases Covers
latin (default) en, european, western, eu Latin script with diacritics — French, German, Spanish, Italian, Portuguese, Polish, Vietnamese, Indonesian, etc.
chinese zh, cn, ch, simplified-chinese, simp, han Simplified Chinese (Han)
english en-only English with denser dictionary than Latin
korean ko, kr, hangul Korean (Hangul)
cyrillic ru, russian, ukrainian, uk, be, bg, sr Cyrillic script (Russian, Ukrainian, Belarusian, Bulgarian, Serbian)
arabic ar, fa, ur, persian, farsi, urdu Arabic script (Arabic, Persian, Urdu)
devanagari hi, hindi, mr, marathi, ne, nepali, sa, sanskrit Devanagari script
greek el, gr, ell Greek
telugu te, telegu Telugu (South Indian)
tamil ta, tamizh, tha Tamil (South Indian)

Multiple packs can live side-by-side under the OCR model directory (portable/models/rapidocr-ppocrv5-latin/ by default); switching packs is a config change, not a re-download. The detection model and classification model are shared across all packs — installing a second pack only downloads its recognition .onnx (~12 MB) and dictionary .txt.

Notes:

  • Japanese, Thai, Traditional Chinese, Georgian, Kannada are not yet in the PP-OCRv5 release. They exist for PP-OCRv4; if you need them, set the appropriate ocr.local.recognitionModelPath and ocr.local.dictionaryPath directly and point at a downloaded v4 model — Zakira.Replay will use whatever files you point it at, regardless of pack.
  • Each pack ships ~12 MB.
  • zakira-replay doctor reports the configured pack under the ocr-models row.

Local Speaker Diarization (--diarize)

--diarize runs local sherpa-onnx speaker diarization over the audio that captions / STT already produced. The pipeline uses pyannote-segmentation-3.0 for speech / speaker-change detection and a 3D-Speaker (ERes2NetV2) embedding extractor + agglomerative clustering to label each transcript segment with a SPEAKER_NN cluster. Everything runs on the local machine via ONNX Runtime — no network at run-time after the models are installed.

--diarize requires a transcript: it labels existing transcript segments, it does not transcribe. Combine with --stt (or rely on captions) to get speech first, then diarize on top.

Setup (one-time, opt-in — ~32 MB of models):

zakira-replay deps install diarization

Use:

# Auto-detect the number of speakers (threshold-based clustering, default 0.5):
zakira-replay analyze https://example.com/talk --stt --diarize

# When you know how many speakers are present, pass --num-speakers to skip the threshold:
zakira-replay analyze meeting.mp4 --stt --diarize --num-speakers 4

# Tune the clustering cutoff (lower = more speakers):
zakira-replay analyze podcast.mp4 --stt --diarize --diarize-threshold 0.35

After diarization, transcript.md is rewritten in place with [SPEAKER_NN] prefixes between the timestamp and the text. TranscriptParser picks the labels back up automatically on the next normalise pass, so the speakers[] registry in evidence.json plus the per-slide and per-chapter speaker rollups in evidence-aligned/by-slide.json and by-chapter.json are all populated without any schema changes.

Configuration keys:

Key Default Purpose
diarization.provider sherpa-onnx Provider identifier; reserved for future plug-ins (pyannoteAI cloud, NeMo, etc.)
diarization.modelDirectory portable models/diarization/ Where deps install places models and where the resolver looks for them
diarization.segmentationModelPath derived Explicit ONNX path; overrides the model directory
diarization.embeddingModelPath derived Explicit ONNX path; overrides the model directory
diarization.numSpeakers null Hard cluster count; when null, falls back to threshold
diarization.threshold 0.5 Agglomerative-clustering cosine cutoff; lower → more speakers
diarization.minDurationOnSeconds 0.3 Minimum speech segment duration emitted by pyannote-segmentation
diarization.minDurationOffSeconds 0.5 Minimum silence gap between speech segments
diarization.threads 1 Native thread count for sherpa-onnx inference
diarization.autoDownload true First-run convenience; set false to require explicit deps install diarization

Environment variables: ZAKIRA_REPLAY_DIARIZATION_PROVIDER, ZAKIRA_REPLAY_DIARIZATION_MODEL_DIRECTORY, ZAKIRA_REPLAY_DIARIZATION_SEGMENTATION_MODEL_PATH, ZAKIRA_REPLAY_DIARIZATION_EMBEDDING_MODEL_PATH, ZAKIRA_REPLAY_DIARIZATION_NUM_SPEAKERS, ZAKIRA_REPLAY_DIARIZATION_THRESHOLD, ZAKIRA_REPLAY_DIARIZATION_MIN_DURATION_ON, ZAKIRA_REPLAY_DIARIZATION_MIN_DURATION_OFF, ZAKIRA_REPLAY_DIARIZATION_THREADS, ZAKIRA_REPLAY_DIARIZATION_AUTODOWNLOAD.

zakira-replay doctor reports the resolved diarization model paths and clustering configuration under the synthetic diarization-models dependency.

Warning codes specific to diarization: DIARIZATION_NO_AUDIO (no audio extracted — diarization needs the WAV the STT step uses), DIARIZATION_NO_TRANSCRIPT (no transcript to label), DIARIZATION_MODELS_MISSING (run deps install diarization), DIARIZATION_INIT_FAILED (native sherpa-onnx initialisation failed), DIARIZATION_FAILED (inference failed mid-run), DIARIZATION_UNKNOWN_PROVIDER (only sherpa-onnx is wired in this release). VTT <v Speaker> tags and SRT Speaker: prefixes that already labelled segments are preserved — diarization never overwrites explicit speaker attribution from captions.

Secrets themselves should stay out of JSON config. The config can store secret environment variable names instead, so agents or humans can choose which variables Zakira.Replay reads without embedding keys on disk. For example, llm.openai.apiKeyEnvVars=OPENAI_API_KEY,WORK_OPENAI_API_KEY tells Zakira.Replay to try those variable names for the OpenAI API key. Built-in defaults are still appended, so standard names keep working.

Batch Manifest

{
  "visionInstruction": "Focus on slide titles and chart axes.",
  "ocrInstruction": "Preserve indentation in code-like text.",
  "frames": 7,
  "useSpeechToText": true,
  "useOcr": true,
  "useVision": true,
  "items": [
    { "source": "https://example.com/video1", "runId": "video-1" },
    { "source": "C:/media/video2.mp4", "frames": 5 }
  ]
}

The batch runner calls the same single-video pipeline for each item and writes a batch result under runs/. Both instructions are optional; the pipeline's baseline already extracts everything visible from frames and every readable piece of text.

Vision and OCR Steering

OCR and vision both have comprehensive baselines. Out of the box (no instruction provided) they extract:

  • Vision: every distinct piece of visible content — title text, bullets, body text, code blocks, chart titles/axes/series, UI controls and labels, captioned text, diagram annotations.
  • OCR: every readable piece of text in the frame, preserving line breaks, with tables surfaced when actually visible.

--vision-instruction <text> and --ocr-instruction <text> (and the equivalent visionInstruction / ocrInstruction fields in MCP and batch) are optional focus signals that bias enumeration order. They never relax the "do not invent" guardrails. Good steering instructions describe what visible aspects matter, not what to conclude:

Good (fact-shaped) Bad (asks for synthesis)
Bias toward slide titles, code blocks, and chart axes. Tell me which approach is better.
Identify on-screen UI controls and their labels. Summarize the speaker's argument.
Capture visible commit messages and terminal output. Score the slide quality.

Both instructions are persisted verbatim into evidence.json and manifest.json (empty string when not provided) so the audit trail records exactly how the run was framed.

Queue / Worker Mode

For scalable CLI orchestration, Zakira.Replay includes a persistent local queue under runs/.queue/<queue-id>/:

zakira-replay queue enqueue https://example.com/video --queue-id research --job-id video-1 --frames 7 --cache --retries 2
zakira-replay queue status --queue-id research --json
zakira-replay queue run --queue-id research --concurrency 2 --retries 2

Queue state is stored in queue.json; the most recent worker pass writes last-run-result.json. Jobs move through pending, running, succeeded, and failed. If a worker stops while jobs are marked running, the next status/run load returns them to pending with a restart note so they can be retried.

--concurrency controls how many jobs a worker pass runs at once. --retries is the retry count beyond the first attempt, so --retries 2 allows up to 3 attempts total.

If --run-id is provided and a completed manifest.json already exists, Zakira.Replay reuses that run by default. Pass --force to recompute it.

If --cache is provided without --run-id, Zakira.Replay computes a deterministic cache key from the source and analysis options and reuses a matching prior run. Cache entries are stored under runs/.cache/.

Frame extraction defaults to scene sampling: ffmpeg returns frames at scene-change boundaries (filter select=gt(scene,0.35)), bounded by frames.sceneSafetyCap (default 5000). Slide grouping deduplicates near-identical scenes. Use --frame-strategy interval to sample N evenly-spaced frames (--frames, default 500, optionally scaled by --frames-per-minute, default frames.perMinute=12 from config). Use --frame-strategy every-frame or --every-frame for capped sequential frame extraction, where --frames/--count is the safety cap.

Clip extraction writes timestamped clips under clips/:

zakira-replay clip C:\media\demo.mp4 --start 01:20 --end 02:05 --output-name dashboard-demo

Search indexing builds over evidence.json transcript, OCR, vision, and warnings. The default backend is a portable JSON TF-IDF index at search/index.json:

zakira-replay search build runs\example-run
zakira-replay search query runs\example-run "wireguard throughput" --top 5

SQLite search is also available. sqlite builds search/index.sqlite with FTS5 keyword/BM25 search. sqlite-onnx additionally stores local ONNX embedding vectors as float32 blobs and queries with hybrid FTS plus brute-force cosine scoring:

zakira-replay search build runs\example-run --backend sqlite
zakira-replay search build runs\example-run --backend sqlite-onnx --onnx-model C:\models\embedding.onnx --onnx-vocab C:\models\vocab.txt
zakira-replay search query runs\example-run "secure tunnel performance" --backend auto --top 5

ONNX embedding support expects a BERT/WordPiece-style vocab.txt plus an ONNX model with common text inputs such as input_ids, attention_mask, and optional token_type_ids. Zakira.Replay does not bundle a model.

Download a compatible local ONNX embedding model with either command:

zakira-replay deps install onnx
.\scripts\download-onnx-model.ps1 -Configure

Both download Xenova/all-MiniLM-L6-v2 files. The built-in installer uses the configured search.onnx.modelDirectory; the script downloads under repository-local models/, which is ignored by git.

Chapter detection builds deterministic offline lexical chapters from transcript topic shifts and duration constraints:

zakira-replay chapters build runs\example-run --min-duration 60 --max-duration 600

It writes chapters/chapters.json and chapters/chapters.md.

MCP Jobs

MCP exposes both a blocking compatibility tool and non-blocking job tools:

  • analyze_video: starts analysis and waits for completion. Use only for short videos.
  • create_analysis_job: starts analysis in the background and returns a jobId.
  • get_job_status: returns status and recent logs.
  • get_job_result: returns the completed manifest and artifact directory.
  • cancel_job: cancels a running job.
  • extract_clip: extracts a timestamped video clip.
  • build_search_index: builds a local search index over a completed run. Optional backend values are json, sqlite, and sqlite-onnx.
  • query_search_index: queries a run directory or search index. Optional backend values are auto, json, sqlite, and sqlite-onnx.
  • build_chapters: builds transcript-based chapters for a completed run and writes chapters/chapters.json plus chapters/chapters.md.

Agents should prefer the job tools for long videos or LLM-backed OCR/vision work.

MCP job snapshots are persisted under runs/.mcp/jobs/. Completed job status and results survive MCP server restarts. Jobs that were pending or running when the server stopped are restored as failed with a restart message.

Agent Skills

Reusable agent skill packages are included in the NuGet package:

  • skills/zakira-replay-cli/SKILL.md: CLI workflow for agents that can run shell commands.
  • skills/zakira-replay-mcp/SKILL.md: MCP workflow for agents connected to zakira-replay mcp serve.
  • skills/zakira-replay/SKILL.md: compatibility router that points agents to the focused CLI or MCP skill.
  • skills/zakira-replay/examples/mcp-client-config.json: generic MCP stdio config.
  • skills/zakira-replay/examples/job-flow.jsonl: raw JSON-RPC MCP job flow.
  • skills/zakira-replay/examples/prompts.md: prompt patterns and execution notes.
  • skills/zakira-replay/examples/artifact-checklist.md: artifact reading checklist.

Agents should load zakira-replay-cli when shell access is available, or zakira-replay-mcp when MCP tools are available. Both skills explain how to produce artifacts, inspect warnings, search evidence, build chapters, and cite timestamps without pretending to watch video directly.

Artifact Contract

Zakira.Replay does not generate books, reports, presentations, summaries, work items, or any other synthesized output. It produces fact-shaped evidence that external orchestrators consume.

Each analyzed video run writes a folder under runs/ containing:

  • request.json: original source, instruction, transcript flag, frame count, and optional run ID.
  • metadata.json: source metadata resolved from the URL or local file, including availableSubtitleLanguages when the source advertises any.
  • manifest.json: stable index of produced artifacts, structured warnings, and per-stage wall-clock timings under timings.totalSeconds + timings.stages.{probe,captions,audio,stt,diarization,frames,slides,ocr,vision,evidence,...}.
  • evidence.json: structured evidence for downstream agents/orchestrators, including per-slide grouping, per-speaker registry, and structured warnings.
  • transcript.md: normalized timestamped transcript when captions or sidecar subtitles are available; [Speaker Name] prefixes are inserted when the source carries speaker tags.
  • transcript/raw.md and transcript/raw.json: raw parsed transcript before normalization.
  • transcript/normalization.json: transcript merge audit report with merge reasons and source/result segments.
  • captions/: raw extracted subtitle files.
  • audio/: extracted audio when requested or needed for STT.
  • audio/chunks/: per-chunk WAV files and chunks.json when long audio is silence-chunked for STT.
  • frames/: representative frame images.
  • slides/slides.json: slide grouping facts (first/last visible per slide, frame IDs, primary frame).
  • ocr/{frameId}.json plus ocr/combined.md: structured OCR result per slide primary frame.
  • vision/{frameId}.json plus vision/combined.md: structured vision result per slide primary frame.
  • chapters/chapters.json and chapters/chapters.md: deterministic transcript-based chapter boundaries (when built).
  • evidence-aligned/by-chapter.json and evidence-aligned/by-slide.json: cross-modal alignment views (when built).
  • evidence.md: human-readable index of the artifact paths.

Synthesis (summaries, work items, decisions, sentiment) is the responsibility of the calling orchestrator; Zakira.Replay does not produce inferences.

JSON schemas for stable machine-readable artifacts are in schemas/:

  • schemas/request.schema.json
  • schemas/manifest.schema.json
  • schemas/evidence.schema.json
  • schemas/transcript-normalization.schema.json
  • schemas/chapters.schema.json
  • schemas/clip.schema.json
  • schemas/search-index.schema.json
  • schemas/audio-chunks.schema.json
  • schemas/slides.schema.json
  • schemas/ocr.schema.json
  • schemas/vision.schema.json
  • schemas/evidence-aligned.schema.json
  • schemas/batch.schema.json
  • schemas/batch-result.schema.json
  • schemas/queue.schema.json
  • schemas/queue-run-result.schema.json

External orchestration can use these artifacts to build conference books, summaries, search indexes, vector stores, QA systems, clip workflows, or custom reports.

Run Timings

Every run writes wall-clock timings to manifest.timings:

{
  "timings": {
    "totalSeconds": 47.832,
    "stages": {
      "probe": 0.412,
      "captions": 0.821,
      "stt": 12.155,
      "diarization": 5.673,
      "frames": 8.214,
      "slides": 0.087,
      "ocr": 6.502,
      "vision": 13.819,
      "evidence": 0.089
    }
  }
}

Stage names are open (orchestrators must tolerate new keys); the canonical set lives in RunTimingStages and is documented in CHANGELOG.md. Stages absent from the map did not run. Values are wall-clock seconds rounded to milliseconds. Use these to flag slow stages, build a "taking longer than usual" alert, or compare end-to-end runtimes across configurations (CPU vs GPU Whisper, local-whisper vs cloud STT, etc.).

Pre-flight: info --json

zakira-replay info --json

Returns a single JSON document covering the configured LLM provider, default model, every schema name, plus the new resolvedDependencies (portable directory, OCR pack, Whisper / Ollama / diarization paths) and capabilities (booleans: localOcrReady, localWhisperReady, diarizationReady, ytDlpAvailable, ffmpegAvailable). Orchestrators can call this once at startup to know which optional features are wired up before issuing analysis requests.

Development

Run the test suite with:

dotnet test Zakira.Replay.slnx

The ffmpeg integration test generates a tiny fixture at runtime and skips automatically when ffmpeg or ffprobe is unavailable.

License

Zakira.Replay is released under the MIT License.

Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

This package has no dependencies.

Version Downloads Last Updated
0.10.1 99 5/19/2026
0.10.0 89 5/18/2026
0.9.1 95 5/18/2026
0.8.0 102 5/17/2026
0.7.0 93 5/17/2026
0.2.0 103 5/10/2026
0.1.0 110 5/9/2026

0.7.0 — Production hardening. Every analysis run now emits per-stage wall-clock timings on `manifest.timings` (probe / captions / audio / stt / diarization / frames / slides / ocr / vision / evidence) for orchestrator branching. `info --json` extended with `resolvedDependencies` (portable directories, OCR pack, Whisper / Ollama / diarization paths) and `capabilities` (booleans for local-ocr, local-whisper, diarization, yt-dlp, ffmpeg readiness) so callers can pre-flight optional features without separately running `doctor`. New retroactive CHANGELOG.md covers 0.2.0 -> 0.7.0. CI now runs the test suite on Windows + Linux + macOS for every push and PR. Additive `manifest.timings` schema; `schemaVersion` stays at 0.8; no breaking changes. See CHANGELOG.md for the full version history. 0.6.0 baseline: IChatClient migration for OpenAI/Azure + OCR language packs. 0.5.0: sherpa-onnx diarization. 0.4.0: Ollama + IChatClient. 0.3.0: local-whisper STT.