SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

A comprehensive pipeline for creating and evaluating video question answering (VQA) datasets using agentic AI workflows. This repository implements an end-to-end system for curating real-world videos, generating multi-task VQA annotations, and evaluating vision-language models on diverse scenarios.

Overview

SONIC-O1 provides a systematic approach to building high-quality VQA datasets across 13+ real-world topics including healthcare consultations, job interviews, emergency scenarios, and more. The pipeline generates three types of VQA tasks (via state-of-the-art LLMs such as Gemini and GPT-4):

Task 1 — Summarization: Short + detailed summaries with temporal timelines
Task 2 — Multiple Choice (MCQ): Questions with plausible distractors
Task 3 — Temporal Localization: Finding specific moments in videos

Pipeline Architecture

The system is organized into 5 stages:

01_data_curation → 02_caption_generation → 03_demographics_annotation → 04_vqa_generation → 05_evaluation_inference

Each stage is self-contained with its own configuration, scripts, and documentation.

Repository Structure

This repository contains the pipeline code only. The dataset and annotations are available separately on Hugging Face (see links at top).

Important: After cloning, you'll have a nested structure: sonic-o1/sonic-o1/

First sonic-o1/ — the git repository root
Second sonic-o1/ — the working directory containing all pipeline code

sonic-o1/                          # Git repository root
└── sonic-o1/                      # Working directory (cd here to run commands)
    ├── 01_data_curation/          # YouTube video collection and filtering
    ├── 02_caption_generation/     # WhisperX-based transcription
    ├── 03_demographics_annotation/# Character demographics extraction
    ├── 04_vqa_generation/         # Multi-task VQA generation
    ├── 05_evaluation_inference/   # Model evaluation framework
    ├── dataset/                   # Downloaded from HuggingFace
    └── vqa/                       # Downloaded from HuggingFace

Not included in this repo (download from Hugging Face):

dataset/ — curated videos, audio, captions, and metadata
vqa/ — generated VQA annotations (3 tasks × topics)

Quick Start

Prerequisites

Python 3.8+
GPU with CUDA support (recommended for caption generation and inference)
API keys/tokens (only for the stages you plan to run):
- YouTube Data API v3 (Stage 01; only if collecting new videos)
- Google Gemini API / OpenAI API (Stages 03–05 depending on backend)
- Hugging Face token (Stage 05 model downloads)

Installation

# Clone the repository
git clone https://github.com/VectorInstitute/sonic-o1.git

# Navigate to working directory (note the nested structure)
cd sonic-o1/sonic-o1

# (Recommended) Download dataset + VQA annotations from Hugging Face
pip install huggingface_hub
huggingface-cli download vector-institute/sonic-o1 --repo-type dataset --local-dir ./

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install base dependencies
pip install -r requirements_venv.txt
# Or using pyproject.toml:
pip install -e .

# Note: Each stage (01-05) may have additional dependencies.
# Stage 05 has model-specific requirements in 05_evaluation_inference/models_requirements/

Environment Setup (API Keys)

Create a .env file in each stage directory only if you plan to run that stage:

# 01_data_curation/.env (only if collecting new videos)
YOUTUBE_API_KEY=your_youtube_api_key

# 03_demographics_annotation/.env (only if annotating new videos)
GEMINI_API_KEY=your_gemini_api_key

# 04_vqa_generation/.env (only if generating new VQA tasks)
OPENAI_API_KEY=your_openai_api_key
GEMINI_API_KEY=your_gemini_api_key

# 05_evaluation_inference/.env (for model evaluation)
HF_TOKEN=your_huggingface_token
GEMINI_API_KEY=your_gemini_api_key  # if using Gemini models
OPENAI_API_KEY=your_openai_api_key  # if using OpenAI models

Note: If you're only evaluating models using the pre-curated dataset, you typically only need Stage 05.

Dataset Download

Before running the pipeline, download the pre-curated dataset from Hugging Face:

pip install huggingface_hub
huggingface-cli download vector-institute/sonic-o1 --repo-type dataset --local-dir ./

The download includes:

dataset/videos/ (topic-organized video files)
dataset/audios/ (extracted audio)
dataset/captions/ (WhisperX transcriptions)
per-topic metadata JSON files
vqa/ directory (task annotations)

Important: Keep the directory names dataset/ and vqa/ exactly as downloaded.

Pipeline Stages (what to run)

01 — Data Curation (optional)

Scrapes and filters high-quality YouTube videos based on configurable topics.

cd 01_data_curation
python parse_topic.py --topics 01_Patient-Doctor_Consultations

See 01_data_curation/README.md for details.

02 — Caption Generation (optional)

Generates transcriptions using WhisperX with word-level timestamps.

cd 02_caption_generation
python whisper_captionGen.py --dataset-root ../dataset --model large-v2

See 02_caption_generation/README.md for installation and usage.

03 — Demographics Annotation (optional)

Extracts character demographics and interactions using vision-language models.

cd 03_demographics_annotation
python run_annotation.py --topics 01_Patient-Doctor_Consultations

See 03_demographics_annotation/README.md for details.

04 — VQA Generation (optional)

Generates VQA tasks using agentic workflows with Gemini/OpenAI backends.

cd 04_vqa_generation
python main.py --topics 1,2,3 --tasks summarization,mcq,temporal_localization

See 04_vqa_generation/README.md for configuration options.

05 — Evaluation & Inference (main entrypoint)

Evaluates vision-language models on the VQA tasks.

cd 05_evaluation_inference

python run_evaluation.py \
  --model videollama2 \
  --tasks t1,t2,t3 \
  --topics all \
  --dataset-path ../dataset \
  --vqa-path ../vqa

Supported models include: VideoLLaMA2, VITA, Gemini, GPT, Uni-MoE variants, and custom integrations.

Metrics:

Task 1: ROUGE-L, Judge-Score
Task 2: Accuracy
Task 3: Temporal IoU, Precision@K, MAE

See 05_evaluation_inference/README.md for model setup and metrics.

Dataset Topics

Patient-Doctor Consultations
Job Interviews
Parent-Teacher Conferences
Customer Service Interactions
Courtroom Proceedings
Emergency Response Scenarios
Public Transportation Conflicts
Workplace Team Meetings
Housing/Apartment Tours
Restaurant Service Encounters
Mental Health Counseling
Community Town Halls
Olympics (Sports events)

Each topic contains 15–25 carefully curated videos with complete annotations.

Output Format Examples

Task 1 — Summarization

{
  "video_id": "abc123",
  "summary_short": ["• Bullet point 1", "• Bullet point 2"],
  "summary_detailed": "Comprehensive narrative...",
  "timeline": [
    {
      "start": "00:01:23",
      "end": "00:02:45",
      "title": "Section Title",
      "note": "Description of events"
    }
  ]
}

Task 2 — MCQ

{
  "video_id": "abc123",
  "question": "...",
  "options": [
    "(A) ...",
    "(B) ...",
    "(C) ...",
    "(D) ...",
    "(E) Not enough evidence"
  ],
  "answer_index": 1,
  "answer_letter": "B",
  "rationale": "..."
}

Task 3 — Temporal Localization

{
  "video_id": "abc123",
  "questions": [
    {
      "question_id": "001",
      "question": "After the speaker ...",
      "temporal_relation": "after",
      "anchor_event": "The speaker ..",
      "target_event": "The speaker states that he is a ...",
      "answer": { "start_s": 35.0, "end_s": 36.62 }
    }
  ]
}

Configuration

Each stage uses YAML configuration files:

01_data_curation/config.yaml — search + filtering parameters
02_caption_generation/config_whisper.yaml — transcription settings
03_demographics_annotation/config.yaml — LLM + annotation settings
04_vqa_generation/config/*.yaml — task-specific VQA generation
05_evaluation_inference/configs/*.yaml — model + metric settings

Citation

If you use this dataset or pipeline in your research, please cite:

@article{radwan2026sonico1,
  title={SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding},
  author={Radwan, Ahmed Y and Emmanouilidis, Christos and Tabassum, Hina and Pandya, Deval and Raza, Shaina},
  journal={arXiv preprint arXiv:2601.21666},
  year={2026}
}

License

This dataset is licensed under the Vector Institute License. The SONIC-O1 dataset may only be accessed and used by:

Academic entities for non-commercial academic research purposes
Vector Institute sponsors and partners

By accessing or using this dataset, you agree to be bound by the terms of the Vector Institute License.

For products or services built using this dataset, prominently display: "Built with Vector Institute SONIC-O1"

Acknowledgments

Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.

This research was funded by the European Union's Horizon Europe research and innovation programme under the AIXPERT project (Grant Agreement No. 101214389).

Troubleshooting

Common issues:

Disk quota exceeded: set cache directories to scratch space (see 02_caption_generation/README.md)
API rate limits: adjust rate_limit_delay in configs
CUDA OOM: use smaller models or reduce batch sizes
Missing dependencies: check individual stage README files
Dataset path issues: ensure you're in sonic-o1/sonic-o1 after cloning

Support

Open an issue on GitHub: https://github.com/VectorInstitute/sonic-o1/issues
Check individual stage README files for detailed troubleshooting
Review stage-specific configuration examples in config/ directories

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs		docs
sonic-o1		sonic-o1
.gitignore		.gitignore
.placeholder		.placeholder
README.md		README.md
pyproject.toml		pyproject.toml
requirements_venv.txt		requirements_venv.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

Overview

Pipeline Architecture

Repository Structure

Quick Start

Prerequisites

Installation

Environment Setup (API Keys)

Dataset Download

Pipeline Stages (what to run)

01 — Data Curation (optional)

02 — Caption Generation (optional)

03 — Demographics Annotation (optional)

04 — VQA Generation (optional)

05 — Evaluation & Inference (main entrypoint)

Dataset Topics

Output Format Examples

Task 1 — Summarization

Task 2 — MCQ

Task 3 — Temporal Localization

Configuration

Citation

License

Acknowledgments

Troubleshooting

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

Overview

Pipeline Architecture

Repository Structure

Quick Start

Prerequisites

Installation

Environment Setup (API Keys)

Dataset Download

Pipeline Stages (what to run)

01 — Data Curation (optional)

02 — Caption Generation (optional)

03 — Demographics Annotation (optional)

04 — VQA Generation (optional)

05 — Evaluation & Inference (main entrypoint)

Dataset Topics

Output Format Examples

Task 1 — Summarization

Task 2 — MCQ

Task 3 — Temporal Localization

Configuration

Citation

License

Acknowledgments

Troubleshooting

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages