Progress log for Sprint 1 MVP implementation. Documents setup steps, test results, and findings so everything is reproducible.
conda create -n video_chat python=3.12 -y
conda activate video_chat
pip install -r app/requirements.txtPython 3.12 chosen because the model was built for transformers==4.51.0 (see config.json: "transformers_version": "4.51.0"). Originally tried 4.55.0 from the CookBook requirements but it had breaking API changes (DynamicCache.seen_tokens removed, generate() return type changed).
huggingface-cli download openbmb/MiniCPM-o-4_5 --local-dir models/MiniCPM-o-4_5- Full BF16, ~19 GB total (4 safetensor files + model code + TTS assets)
- Downloaded to
models/inside the project (not~/models/), for Docker bind-mount compatibility later - Model includes Python files (custom architecture,
trust_remote_code=Truerequired) - Security scan of
.pyfiles: clean, no network calls during inference, no data exfiltration
The downloaded model has a bug: model.chat(stream=True) crashes because chat() doesn't short-circuit for streaming and falls through to TTS post-processing that expects non-streaming output.
Fix: One-line patch in models/MiniCPM-o-4_5/modeling_minicpmo.py (after line ~1195).
See docs/model_patches.md for full details and how to verify/reapply.
HF_HOME is set to models/.hf_cache/ (inside the project) to avoid polluting ~/.cache/huggingface/. This is done in app/model_server.py before importing transformers.
python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda, torch.cuda.get_device_name(0))"
# True 12.6 NVIDIA GeForce RTX 4090app/__init__.py-- package markerapp/config.py-- all configuration, environment variable overridableapp/model_server.py-- model loading (vision-only) +infer()functionscripts/test_model.py-- standalone test script
- Vision-only mode:
init_audio=False, init_tts=False, init_vision=True(saves ~2-4 GB VRAM) CUDA_VISIBLE_DEVICESset in app code (not conda env), configurable via env varHF_HOMEredirected to project-localmodels/.hf_cache/- Streaming inference via
TextIteratorStreamer(background thread in model code) <|im_end|>tokens filtered from output
python -m scripts.test_model --image test_files/images/test.jpg| Metric | Value |
|---|---|
| Model load time | 5.7s |
| Inference time (1 frame, full response) | 11.5s |
| Response length | 1759 chars |
| VRAM allocated | 16.4 GB |
| VRAM reserved | 16.7 GB |
| VRAM free | ~7.5 GB (of 24 GB) |
The model correctly described a Raspberry Pi robot car from the test image, with detailed component identification. Streaming worked -- text chunks arrived progressively.
- 11.5s was for a long, unconstrained response (~512 tokens). The monitoring loop targets shorter updates with fewer
max_new_tokens, so cycle time should be lower. - 7.5 GB VRAM headroom is sufficient for multi-frame context (8 frames at 64 tokens/frame = 512 tokens).
transformers==4.51.0shows a deprecation warning aboutseen_tokens-- harmless, will go away when/if the model code is updated upstream.
Evaluated video capture libraries for our requirements (webcam, video files, later RTSP/v4l2loopback, 1 FPS, PIL output, background thread, Docker compatible):
| Library | Strengths | Weaknesses | Verdict |
|---|---|---|---|
OpenCV (cv2.VideoCapture) |
Simplest API, already installed, webcam + files + v4l2 + RTSP | No built-in threading | Chosen for PoC |
| VidGear/CamGear | Built-in threaded capture, better stream protocol support | Extra dependency, threading optimization irrelevant at 1 FPS | Consider for Sprint 2 |
| ffmpegcv | Lightweight FFmpeg wrapper, GPU decode, good stream support | Less established, extra dependency | Interesting alternative for later |
| GStreamer | Most flexible, native HW acceleration plugins | Steep learning curve, overkill for PoC | No |
Decision: OpenCV. At 1 FPS capture rate, frame capture performance is irrelevant -- model inference (11.5s) is the bottleneck, not reading a single frame. OpenCV covers all MVP sources (webcam, video files) and also supports v4l2loopback devices and RTSP for Sprint 2. Threading is a simple daemon thread wrapper. The modular design (frame_capture.py behind a clean interface) allows swapping the backend later without affecting the rest of the app.
app/frame_capture.py-- background thread capture withon_framecallbackapp/sliding_window.py-- thread-safe ring buffer (deque, max 16 frames)scripts/test_capture.py-- standalone test for capture + window
OpenCV's cv2.VideoCapture.read() reads frames sequentially for video files -- each call returns the next frame regardless of elapsed time. At 25 FPS source and 1 FPS capture, 5 reads in 5 seconds only covered the first 0.2 seconds of video.
Fix: For video files, skip src_fps / CAPTURE_FPS frames per interval using grab() (fast, no decode) before read() (decode only the target frame). This simulates real-time playback. Not needed for live sources (webcam) where read() always returns the latest frame.
python -m scripts.test_capture --source test_files/videos/test.mp4- Test video: 9.6s, 25.1 FPS, 1264x720
- Auto-detected video duration, captured for full length
- 10 frames captured at 1 FPS, evenly spread across the video
- All frames saved to
test_files/videos/capture_test/for visual verification - Frame quality matches source video (no additional compression)
- Sliding window correctly held all frames with push/get working across threads
Implemented together with Step 3. Simple collections.deque with maxlen=WINDOW_SIZE (default 16). Thread-safe via threading.Lock. Capture thread pushes, inference loop reads. Old frames auto-evicted.
No separate test needed -- validated as part of the capture test above.
app/monitor_loop.py-- async orchestrator with IDLE/ACTIVE modesscripts/test_monitor.py-- standalone end-to-end test (model + capture + loop)
The monitor loop is the core orchestrator that ties frame capture, sliding window, and model server together. It runs as an async task and periodically grabs frames from the sliding window to run inference.
Two modes:
- IDLE -- frames captured and buffered, no inference (waiting for user instruction)
- ACTIVE -- periodic inference with current instruction every
INFERENCE_INTERVALseconds
Key patterns:
asyncio.Eventfor immediate cycle trigger on instruction change- Pub/sub pattern for streaming text chunks to multiple consumers (SSE, WebSocket, test scripts)
loop.run_in_executor()for non-blocking model inference (blocking call in thread pool)loop.call_soon_threadsafe()to push chunks from inference thread to async queueNonesentinel in queue marks end of an inference cycle_startedevent +wait_started()to synchronize consumers with the loop startup
asyncio.create_task(monitor.run()) schedules the task but doesn't execute it until the current coroutine yields. This caused two problems:
stream()checked_running(stillFalse) and exited immediately -- no outputstop()called beforerun()started, thenrun()overrode_runningback toTrue-- infinite loop
Fix: Added _started asyncio.Event that run() sets when it begins, and wait_started() for consumers to await. Added _stop_requested flag to prevent run() from starting after stop().
python -m scripts.test_monitor --source test_files/videos/test.mp4 --cycles 2Test video: 9.6s animated scene of a cat protesting with police dogs (loops automatically).
| Metric | Cycle 1 | Cycle 2 | Cycle 3 (new instruction) |
|---|---|---|---|
| Instruction | "describe what you see" | "describe what you see" | "list only the colors" |
| Inference time | 7.5s | 9.3s | 1.7s |
| Output | Detailed scene description (~200 words) | Different detailed description | "Orange, brown, black, white, yellow, gray" |
Observations:
- Model correctly identified scene contents across 8 frames: cat on crate, "MI-AUW! NOW!" sign, police dogs, urban setting, lighting
- Each cycle produces different wording for the same scene (non-deterministic with
do_sample=True) - Instruction change works: "list only the colors" gave a 6-word response in 1.7s vs 7-9s for detailed descriptions
- Short instructions = short answers = fast cycles. This is useful for tuning
INFERENCE_INTERVALlater - Model load: 3.9s (warm, model already cached from previous tests)
- Frame accumulation: ~7s to buffer 8 frames at 1 FPS before first inference
- Video file looping works correctly -- capture continued beyond the 9.6s video duration
After initial testing, refactored the monitor loop from a single asyncio.Queue to a pub/sub pattern (subscribe()/unsubscribe()/_publish()). Each consumer (SSE connection, WebSocket client, test script) gets its own independent queue via subscribe(). This makes it trivial to add new transport types later (WebSocket for TTS/STT, etc.) without changing the monitor loop itself.
Retested after refactor -- same behavior, no regressions.
app/main.py-- FastAPI entry point with all endpoints and lifespan management
Thin HTTP layer over the existing components. The server loads the model on startup via FastAPI's lifespan context manager, then exposes the pipeline through REST endpoints and SSE.
Endpoints:
| Method | Path | Purpose |
|---|---|---|
| GET | / |
Serve web UI (static/index.html) or placeholder |
| GET | /api/status |
Model loaded? Capture running? Mode? Instruction? Frame count? |
| POST | /api/start |
Start capture: {"source": "test_files/videos/test.mp4"} |
| POST | /api/stop |
Stop capture, clear instruction |
| POST | /api/instruction |
Set/change instruction: {"instruction": "describe what you see"} |
| GET | /api/stream |
SSE endpoint -- each connection gets its own subscriber |
| GET | /api/frame |
Latest frame as JPEG (quality 80) |
Key patterns:
- FastAPI lifespan for startup (model load, monitor loop start) and shutdown (cleanup)
- SSE via
StreamingResponsewithtext/event-stream-- no extra dependencies - Each SSE connection subscribes to the monitor loop's pub/sub independently
- SSE keepalive comments every 15s to prevent proxy/browser timeout
- Typed SSE events:
data:for text chunks,event: cycle_endfor cycle boundaries - Frame endpoint converts PIL Image to JPEG in-memory via
io.BytesIO - Pydantic models for request/response validation
- Server port: 8199 (configurable via
SERVER_PORTenv var)
Tested all endpoints with curl against the running server:
python -m app.main # starts server on port 8199| Endpoint | Result |
|---|---|
GET /api/status |
JSON response with all fields correct (IDLE, no capture, 0 frames) |
POST /api/start |
Capture started on test video, status updated to capture_running=true |
GET /api/status (after 3s) |
14 frames buffered, still IDLE (no instruction yet) |
GET /api/frame |
JPEG 1264x720, valid image |
POST /api/instruction |
Instruction set, monitor switched to ACTIVE, inference started |
GET /api/stream |
SSE chunks streaming, 3 full cycles received with cycle_end events |
POST /api/stop |
Capture stopped, instruction cleared, back to IDLE |
All endpoints working correctly. SSE streaming delivers text chunks in real-time with proper event formatting.
app/static/index.html-- single-file vanilla HTML/JS/CSS, no build step
Split-panel layout: video preview on the left (polling /api/frame), AI commentary on the right (SSE /api/stream). Controls at the bottom: source input, start/stop buttons, instruction input.
Key features:
- Frame polling via
new Image()at 500ms intervals (avoids broken image flicker) - SSE with
EventSource-- auto-reconnect built in - Cycle metadata displayed per commentary block (frame IDs, capture timestamps, inference time, latency)
- Source input accepts file paths, device IDs, and stream URLs (RTSP, HTTP)
- Status badge reflects IDLE/ACTIVE mode
Opened http://localhost:8199 in browser. Verified:
- Video frames appear and update in real-time
- Start/stop/instruction controls work
- Commentary streams in with per-cycle metadata
- Instruction change mid-stream works correctly
Full pipeline tested: server startup → capture start → instruction set → streaming commentary → instruction change → stop.
Test with short video (9.6s cat scene, looping):
- Commentary: short, relevant, non-repeating
- Inference: ~1.2-2.3s per cycle
- Latency: ~8.6-9.4s (dominated by 8 frames × 1 FPS = 7s frame age)
Test with longer video (animated series, multiple scenes):
- Model correctly tracked scene changes: characters, actions, setting transitions
- Commentary stayed focused on new/changed elements
- Inference: ~1.3-1.7s per cycle
- Latency: ~8.3-9.0s
1. Commentator-style prompting:
Replaced open-ended "describe what you see" with a system prompt that enforces 1-2 sentence responses, change-only focus, and no repetition. Moved to config.py as COMMENTATOR_PROMPT for easy customization.
2. Context carry-over: Previous response is included in the next prompt ("Your last comment was: ...") so the model avoids repeating itself. Reset on instruction change.
3. Scene change detection:
Before each inference cycle, compares the newest frame to the previous cycle's frame using mean pixel difference on 64x64 thumbnails. If below CHANGE_THRESHOLD (default 5.0), the cycle is skipped entirely. Saves GPU time on static scenes.
4. Centralized configuration:
All hardcoded values moved to config.py with documentation. Every parameter is overridable via environment variable. Single source of truth.
| Component | Time | Notes |
|---|---|---|
| Frame age (oldest of 8 at 1 FPS) | ~7s | Reduce with fewer frames or higher FPS |
| Inference | ~1.3-2.3s | Depends on response length |
| Cycle interval wait | ~0-5s | Time until next cycle starts |
| Total end-to-end | ~8-11s | From oldest frame capture to response complete |
# Lower latency: fewer frames, higher capture rate
FRAMES_PER_INFERENCE=4 CAPTURE_FPS=2 python -m app.main
# Shorter responses
MAX_NEW_TOKENS=256 python -m app.main
# More aggressive change detection (skip more cycles)
CHANGE_THRESHOLD=10 python -m app.main