Skip to content

MILVLG/videoarm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding

School of Computer Science, Hangzhou Dianzi University, China

Yufei Yin, Qianke Meng, Minghao Chen, Jiajun Ding, Zhenwei Shao, Zhou Yu*

arXiv GitHub

This repository contains the official implementation of the 2026 CVPR paper VideoARM, which progressively localizes, interprets, and abstracts evidence in an adaptive observe–think–act–memorize loop. Extensive experiments demonstrate that VideoARM maintains strong performance while significantly reducing token consumption. Our official skill implementation is also available on both GitHub and ClawHub.

Figure 2: Overview

News

  • [2026.04.09] Our official implementation is available.

  • [2026.03.25] Our official skill implementation is available on both GitHub and ClawHub.

  • [2026.02.21] Our paper is accepted at CVPR 2026 🎉

  • [2025.12.13] Our paper is released at arxiv.

Installation

pip install -e .

Or install dependencies directly:

pip install opencv-python openai requests python-dotenv numpy

API keys

Copy .env.example to .env and fill in your OpenAI API key:

OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=https://api.openai.com/v1

Usage

Command line

# Open-ended question
python main.py --video path/to/video.mp4 --question "What happens in this video?"

# Multiple-choice (letter A-D, default)
python main.py --video video.mp4 --question "A. ... B. ... C. ... D. ..." --multiple-choice

# Multiple-choice (number 0-4)
python main.py --video video.mp4 --question "..." --multiple-choice --choice-format number

# Override the controller model
python main.py --video video.mp4 --question "..." --model gpt-4o

# Run without saving the result trace
python main.py --video video.mp4 --question "..." --no-save

Configuration

Model selection

Environment variable Default Description
VIDEOARM_MODEL_CONTROLLER o3 Reasoning controller
VIDEOARM_MODEL_CLIP_ANALYZER gpt-4.1 Clip Analyzer + Scene Snapper
VIDEOARM_MODEL_AUDIO_TRANSCRIBER whisper-1 Audio Transcriber

Pipeline parameters

Environment variable Default Description
VIDEOARM_MAX_ITERATIONS 10 Step budget N
VIDEOARM_MAX_FRAMES_PER_TOOL 150 Max frames passed per tool call
VIDEOARM_FRAME_ANALYSIS_MAX_FRAMES 50 Frames sampled by Clip Analyzer
VIDEOARM_AUDIO_MAX_FRAMES 15000 Max frames for audio extraction

Per-component API overrides

To route different tools to different API endpoints:

VIDEOARM_API_KEY_CONTROLLER=sk-...
VIDEOARM_BASE_URL_CONTROLLER=https://...

VIDEOARM_API_KEY_CLIP_ANALYZER=sk-...
VIDEOARM_BASE_URL_CLIP_ANALYZER=https://...

See .env.example for the full list of options.

Citation

If you find our work useful, please consider citing:

@inproceedings{yin2026videoarm,
  title={VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding},
  author={Yin, Yufei and Meng, Qianke and Chen, Minghao and Ding, Jiajun and Shao, Zhenwei and Yu, Zhou},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages