VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding

School of Computer Science, Hangzhou Dianzi University, China

Yufei Yin, Qianke Meng, Minghao Chen, Jiajun Ding, Zhenwei Shao, Zhou Yu^*

This repository contains the official implementation of the 2026 CVPR paper VideoARM, which progressively localizes, interprets, and abstracts evidence in an adaptive observe–think–act–memorize loop. Extensive experiments demonstrate that VideoARM maintains strong performance while significantly reducing token consumption. Our official skill implementation is also available on both GitHub and ClawHub.

News

[2026.04.09] Our official implementation is available.
[2026.03.25] Our official skill implementation is available on both GitHub and ClawHub.
[2026.02.21] Our paper is accepted at CVPR 2026 🎉
[2025.12.13] Our paper is released at arxiv.

Installation

pip install -e .

Or install dependencies directly:

pip install opencv-python openai requests python-dotenv numpy

API keys

Copy .env.example to .env and fill in your OpenAI API key:

OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=https://api.openai.com/v1

Usage

Command line

# Open-ended question
python main.py --video path/to/video.mp4 --question "What happens in this video?"

# Multiple-choice (letter A-D, default)
python main.py --video video.mp4 --question "A. ... B. ... C. ... D. ..." --multiple-choice

# Multiple-choice (number 0-4)
python main.py --video video.mp4 --question "..." --multiple-choice --choice-format number

# Override the controller model
python main.py --video video.mp4 --question "..." --model gpt-4o

# Run without saving the result trace
python main.py --video video.mp4 --question "..." --no-save

Configuration

Model selection

Environment variable	Default	Description
`VIDEOARM_MODEL_CONTROLLER`	`o3`	Reasoning controller
`VIDEOARM_MODEL_CLIP_ANALYZER`	`gpt-4.1`	Clip Analyzer + Scene Snapper
`VIDEOARM_MODEL_AUDIO_TRANSCRIBER`	`whisper-1`	Audio Transcriber

Pipeline parameters

Environment variable	Default	Description
`VIDEOARM_MAX_ITERATIONS`	`10`	Step budget N
`VIDEOARM_MAX_FRAMES_PER_TOOL`	`150`	Max frames passed per tool call
`VIDEOARM_FRAME_ANALYSIS_MAX_FRAMES`	`50`	Frames sampled by Clip Analyzer
`VIDEOARM_AUDIO_MAX_FRAMES`	`15000`	Max frames for audio extraction

Per-component API overrides

To route different tools to different API endpoints:

VIDEOARM_API_KEY_CONTROLLER=sk-...
VIDEOARM_BASE_URL_CONTROLLER=https://...

VIDEOARM_API_KEY_CLIP_ANALYZER=sk-...
VIDEOARM_BASE_URL_CLIP_ANALYZER=https://...

See .env.example for the full list of options.

Citation

If you find our work useful, please consider citing:

@inproceedings{yin2026videoarm,
  title={VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding},
  author={Yin, Yufei and Meng, Qianke and Chen, Minghao and Ding, Jiajun and Shao, Zhenwei and Yu, Zhou},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
docs		docs
videoarm		videoarm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
framework.pdf		framework.pdf
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding

News

Installation

API keys

Usage

Command line

Configuration

Model selection

Pipeline parameters

Per-component API overrides

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding

News

Installation

API keys

Usage

Command line

Configuration

Model selection

Pipeline parameters

Per-component API overrides

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages