School of Computer Science, Hangzhou Dianzi University, China
Yufei Yin, Qianke Meng, Minghao Chen, Jiajun Ding, Zhenwei Shao, Zhou Yu*
This repository contains the official implementation of the 2026 CVPR paper VideoARM, which progressively localizes, interprets, and abstracts evidence in an adaptive observe–think–act–memorize loop. Extensive experiments demonstrate that VideoARM maintains strong performance while significantly reducing token consumption. Our official skill implementation is also available on both GitHub and ClawHub.
-
[2026.04.09] Our official implementation is available.
-
[2026.03.25] Our official skill implementation is available on both GitHub and ClawHub.
-
[2026.02.21] Our paper is accepted at CVPR 2026 🎉
-
[2025.12.13] Our paper is released at arxiv.
pip install -e .Or install dependencies directly:
pip install opencv-python openai requests python-dotenv numpyCopy .env.example to .env and fill in your OpenAI API key:
OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=https://api.openai.com/v1
# Open-ended question
python main.py --video path/to/video.mp4 --question "What happens in this video?"
# Multiple-choice (letter A-D, default)
python main.py --video video.mp4 --question "A. ... B. ... C. ... D. ..." --multiple-choice
# Multiple-choice (number 0-4)
python main.py --video video.mp4 --question "..." --multiple-choice --choice-format number
# Override the controller model
python main.py --video video.mp4 --question "..." --model gpt-4o
# Run without saving the result trace
python main.py --video video.mp4 --question "..." --no-save| Environment variable | Default | Description |
|---|---|---|
VIDEOARM_MODEL_CONTROLLER |
o3 |
Reasoning controller |
VIDEOARM_MODEL_CLIP_ANALYZER |
gpt-4.1 |
Clip Analyzer + Scene Snapper |
VIDEOARM_MODEL_AUDIO_TRANSCRIBER |
whisper-1 |
Audio Transcriber |
| Environment variable | Default | Description |
|---|---|---|
VIDEOARM_MAX_ITERATIONS |
10 |
Step budget N |
VIDEOARM_MAX_FRAMES_PER_TOOL |
150 |
Max frames passed per tool call |
VIDEOARM_FRAME_ANALYSIS_MAX_FRAMES |
50 |
Frames sampled by Clip Analyzer |
VIDEOARM_AUDIO_MAX_FRAMES |
15000 |
Max frames for audio extraction |
To route different tools to different API endpoints:
VIDEOARM_API_KEY_CONTROLLER=sk-...
VIDEOARM_BASE_URL_CONTROLLER=https://...
VIDEOARM_API_KEY_CLIP_ANALYZER=sk-...
VIDEOARM_BASE_URL_CLIP_ANALYZER=https://...
See .env.example for the full list of options.
If you find our work useful, please consider citing:
@inproceedings{yin2026videoarm,
title={VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding},
author={Yin, Yufei and Meng, Qianke and Chen, Minghao and Ding, Jiajun and Shao, Zhenwei and Yu, Zhou},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}