GitHub - OpenMOSS/MOSS-Video-Preview: A real-time video understanding foundation model built on Llama-3.2-Vision, featuring comprehensively extended video processing and multimodal reasoning capabilities.

English | 中文

MOSS-Video-Preview: Next-Generation Real-Time Video Understanding

MOSS-Video-Preview is a multimodal vision foundation model specifically engineered for real-time video understanding. Built upon the Llama-3.2-Vision architecture, we have comprehensively extended the model's native video processing capabilities, empowering it with state-of-the-art real-time multimodal reasoning performance.

Important

💡 Project Note:

At this stage, this project serves as an exploratory endeavor, leveraging high-quality open-source datasets to validate the potential of the Cross-Attention architecture for native real-time video understanding. This is only the beginning; we have committed to a comprehensive scaling roadmap across three dimensions: Data Scaling, Parameter Scaling, and Context Scaling, with the goal of building more robust and general-purpose video intelligence.

As we strive to build more robust and general-purpose video intelligence, we warmly welcome experts in Representation Learning, Model Compression and Inference Acceleration to join our journey. Whether you are optimizing inference latency or exploring efficient architecture, we invite you to experiment and innovate on top of our framework. Let's push the boundaries of video intelligence and advance the open-source community together!

🌟 Key Highlights

🧩 Image-Video Cross-Attention Architecture: By transcending the limitations of conventional architectures, MOSS-Video-Preview leverages a native Cross-Attention mechanism to provide unified image-video understanding. This approach enables deep decoupling of visual and linguistic features, facilitating seamless and continuous analysis of ultra-long temporal sequences.
🔄 Millisecond-Level Interaction & Dynamic Self-Correction: The system supports seamless transitions between "Silence" and "Speak" modes. With enhanced contextual awareness, the model allows for real-time interruptions to adjust or refine responses dynamically as video scenes evolve, delivering a truly responsive, full-duplex user experience.
⚡ Extreme Inference Performance & Kernel Optimization: By leveraging deeply optimized Cross-Attention kernels and Flash Attention 2 acceleration on both CUDA and NPU platforms, MOSS-Video-Preview is specifically engineered for long-form video processing. It achieves ultra-low latency while significantly reducing memory overhead.
📊 Fine-grained Data Synthesis Pipeline: We have engineered a sophisticated data synthesis pipeline for real-time video understanding, powered by state-of-the-art multimodal LLMs. We are committed to open-sourcing these datasets in the near future to support the research community and collectively advance the frontier of real-time video perception.

📌 Table of Contents

🔥 News
🏗️ Model Architecture
🌊 Real-Time Inference Process
🎬 Demo
📊 Training Stages & Data Composition
📊 Evaluation Results
📈 Streaming Inference Speed (Single-Setup Measurement)
🚀 Quick Start
🛠️ Training & Fine-tuning
📥 Model Download
💡 Limitations & Future Outlook
📑 TODO List
Citation
Acknowledgement

🔥 News

2026/04/08: 🎉 MOSS-VL is officially open-sourced! Released MOSS-VL-Base-0408 and MOSS-VL-Instruct-0408.
2026/03/04: 🚀 MOSS-Video-Preview source code and architecture details released!
2025/10/18: 🧭 Post-mortem on current issues; started MOSS-VL project.
2025/10/08: 🎬 Internal demo showcased within the lab and the school.
2025/09: 🌟 moss-video-preview-realtime-sft ready.
2025/08: ✅ moss-video-preview-sft ready.

🏗️ Model Architecture

Built on a native Real-Time temporal architecture, MOSS-Video-Preview decouples visual perception and linguistic reasoning to minimize computational latency. This enables millisecond-level streaming performance, ensuring a highly responsive and fluid interactive experience for continuous video streams.

Figure 1: Overall architecture of MOSS-Video-Preview.

🌊 Real-Time Inference Process

The core strength of MOSS-Video-Preview lies in its native real-time streaming capability, enabling continuous, low-latency processing of live video feeds.

Figure 2: Real-Time inference pipeline.

⚙️ Inference Mechanism

Asynchronous Real-Time Input Video frames are continuously injected at a stable frame rate for high-frequency real-time perception. The input process is non-blocking and fully decoupled from the text generation loop, ensuring uninterrupted visual tracking.
Long-range State Persistence Leveraging Cross-Attention KV Cache and Temporal Positional Encoding, the model maintains robust contextual dependencies across continuous frames, ensuring coherent temporal understanding over extended sequences.
Ultra-Low Latency Streaming Response The model supports simultaneous autoregressive generation alongside the incoming video stream. By eliminating the need for full-clip buffering, it achieves "on-the-fly" reasoning and interaction with minimal end-to-end latency.

🧩 Core Components

Cross-Modal Projector Featuring the proprietary VideoMllamaTextCrossAttention mechanism, this component utilizes bidirectional cross-attention to achieve highly efficient fusion and semantic alignment between temporal visual features and linguistic context.
Streaming Causal Decoding Module A specialized decoder for autoregressive generation based on dynamic visual inputs. It possesses dynamic adaptability, allowing it to real-time adjust and refine generated content based on the latest visual cues captured from the stream.

🎬 Demo

Real-Time Video

streaming_demo.mp4

Offline Video

video_demo.mp4

Offline Image Demo Video

image_demo.mp4

📊 Training Stages & Data Composition

MOSS-Video-Preview employs a three-stage progressive training strategy to evolve the model from basic modality alignment to complex real-time video reasoning.

Stage	Core Objective	Trainable Parameters	Data Mixture (T / I / V)	Training Samples
PT-Stage 1	Cross-modal Alignment	Vision Projector only	0% / 79% / 21%	15.1 M
PT-Stage 2	Temporal & Long Video Perception	Full Parameters	0% / 26% / 74%	1.8 M
Offline SFT	Instruction Following & Reasoning	Full Parameters	14% / 44% / 42%	8.6 M
Real-Time SFT	Real-Time understanding and reasoning	Full Parameters	11% / 29% / 60%	836 K

📊 Evaluation Results

Benchmark comparison: MOSS-Video-Preview vs. baselines

Performance Consistency in the Real-time Variant: Experimental results show that MOSS-Video-Preview-Realtime-SFT achieves near-lossless performance retention. Its performance remains highly consistent with the standard SFT version across MMBench, AI2D, and the majority of video benchmarks, even showing superior results in specific temporal understanding tasks like TempCompass. This confirms the model's capability to balance real-time response requirements with high-precision perception in real-world deployment scenarios.
Visual Logical Reasoning: In the Multimodal Reasoning category, the MOSS series demonstrates robust logical deduction performance. Notably, on the VisuLogic benchmark, both MOSS variants (28.60 / 28.70) outperform LLaVA-OneVision (27.00) and Qwen2.5-VL (25.90). This reflects the models' superior stability when handling logically challenging tasks such as visual patterns and spatial reasoning.
Fine-grained Video Insight: The MOSS series shows significant competitive edge in fine-grained action logic and spatio-temporal perception within the video understanding domain. On the Video-Holmes benchmark, the MOSS series achieved high scores of 39.30 / 39.50, while Qwen2.5-VL scored 33.00. These results indicate that MOSS possesses a deeper perceptual capacity for capturing subtle motions and complex spatio-temporal dynamics in long video sequences compared to other open-source models of the same scale.

The core optimization of MOSS-Video-Preview lies in bridging the gap between high-quality reasoning and low-latency real-time streaming, as further evidenced by the speed measurement below.

📈 Streaming Inference Speed (Single-Setup Measurement)

We measure streaming inference speed of MOSS-Video-Preview against another strong open-source video model under the same hardware and decoding configuration (this is a single-setup speed comparison, not a standardized benchmark suite).

Hardware: NVIDIA H200 (single GPU)
Video sampling: 256 extracted frames
Input video:
- Path: data/example_video.mp4
- Resolution: 1920×1080
- Duration: 97.56 s
- Bitrate: 2223.33 kbps (approx)

Speed comparison (higher TPS and lower latency are better):

Model	Frames	Parameters	Avg TTFT (s)	Avg TPS (tokens/s)	Avg Total Latency (s)	P95 TTFT (s)
MOSS-Video-Preview	256	11B	1.9537	38.41	28.5104	1.9573
Qwen2.5-VL-7B	256	7B	9.9402	14.26	52.7624	9.9564

Under this setting, MOSS-Video-Preview delivers ~5× faster TTFT, ~2.7× higher decoding throughput (TPS), and significantly lower end-to-end latency compared with Qwen2.5-VL-7B, making it highly suitable for Real-Time Video Understanding. It remains strongly competitive even with a larger parameter count, indicating substantial headroom for further speedup in larger-scale settings.

🚀 Quick Start

Environment Setup

conda create -n moss-video python=3.12.4 -y
conda activate moss-video
pip install -e .

Example Data

This repository includes a small set of example files:

Video: data/example_video.mp4
Image: data/example_image.jpg

Optional: Install PyTorch and FlashAttention2 (recommended for CUDA GPUs)

Tested setup: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1.

First, install PyTorch (select the appropriate build for your CUDA/CPU environment), then install FlashAttention2 and DeepSpeed:

# CUDA 12.1 (recommended)
pip install --index-url https://download.pytorch.org/whl/cu121 "torch==2.4.0"

# CPU-only (if CUDA is unavailable)
# pip install --index-url https://download.pytorch.org/whl/cpu "torch==2.4.0"

pip install -e ".[flash-attn,deepspeed]" --no-build-isolation

Run Inference

MOSS-Video-Preview supports offline, and streaming inference modes.

1. Offline Inference (Base/SFT checkpoints)

Offline inference processes the entire video at once. This is suitable for batch processing or analyzing pre-recorded videos.

# Run offline inference demo
python -m inference.offline_infer \
  --checkpoint models/moss-video-sft \
  --video_path data/example_video.mp4 \
  --prompt "Describe the video." \
  --max_new_tokens 512

2. Real-Time SFT Offline Inference (Real-Time SFT checkpoints only)

This mode runs offline (non-streaming) generation but must use a Real-Time SFT checkpoint (the same type as used for streaming inference). It is not compatible with base or plain SFT checkpoints.

# Run Real-Time SFT offline inference demo
python -m inference.realtime_offline_infer \
  --checkpoint models/moss-video-realtime-sft \
  --video_path data/example_video.mp4 \
  --prompt "Describe the video." \
  --max_new_tokens 512

3. Streaming Inference (Real-Time SFT checkpoints only)

Streaming inference processes video frames in real-time as they are received. This is ideal for live streams or low-latency applications.

# Run streaming inference demo
python -m inference.realtime_streaming_infer \
  --checkpoint models/moss-video-realtime-sft \
  --video_path data/example_video.mp4 \
  --prompt "Describe the video." \
  --max_new_tokens 512

The streaming inference uses a unified pipeline where frames are fed into an image_queue and tokens are consumed from a token_queue in real-time.

🛠️ Training & Fine-tuning

MOSS-Video-Preview supports a variety of training modes via LlamaFactory integration.

Mode	VRAM (GB/GPU)	Hardware	Config File
PT (Pretrain)	≈80GB	H100/H200	`mllm_pretrain_1node.yaml`
SFT (Offline)	≈80GB	H100/H200	`mllm_offline_sft_1node.yaml`
SFT (Real-time)	≈80GB	H100/H200	`mllm_realtime_sft_1node.yaml`

To start training, use the following command:

FORCE_TORCHRUN=1 llamafactory-cli train train_config/mllm_pretrain_1node.yaml

You can choose different configuration files from the train_config directory based on the training stage:

pretrain: train_config/mllm_pretrain_1node.yaml
sft-offline: train_config/mllm_offline_sft_1node.yaml
sft-realtime: train_config/mllm_realtime_sft_1node.yaml

📥 Model Download

Model	🤗Download Link	🤖ModelScope Link
moss-video-preview-base	HuggingFace	ModelScope
moss-video-preview-sft	HuggingFace	ModelScope
moss-video-preview-realtime-sft	HuggingFace	ModelScope

🚀 Limitations & Future Outlook

Performance Benchmarking: While the real-time comprehension capability has been successfully validated, a performance gap remains compared to top-tier semi-open-source models such as Qwen2.5-VL. Closing this gap and aligning with SOTA benchmarks is a primary focus for our future iterations.
Scalable Distributed Training: The current training pipeline is primarily optimized for architectural validation. We plan to integrate the Megatron-LM framework to leverage advanced 3D parallelism (Tensor, Pipeline, and Data Parallelism) for large-scale pre-training and fine-tuning. In the next major release, we will officially open-source the complete training codebase, model weights, and configurations.
Data Scaling & Diversity: Our current training relies heavily on public datasets. Future updates will focus on expanding the scale and diversity of our multimodal data to enhance the model's generalizability and overall robustness across a wider range of real-world scenarios.

📑 TODO List

Citation

@misc{moss_video_2026,
  title         = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/OpenMOSS/MOSS-Video-Preview}},
  note          = {GitHub repository}
}

Contributor Roles

Core Contributor: Pengyu Wang*, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng
Contributor: Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Hongkai Wang, Pengfei Wang, Chenghao Liu, Shanqing Gao, Yixian Tian, Xinghao Wang, Botian Jiang, Xipeng Qiu†

Legend: * Project Leader; † Corresponding Author

Acknowledgement

We extend our gratitude to the contributors of LlamaFactory, Transformers, and the OpenMOSS community for their invaluable support.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
data		data
examples		examples
inference		inference
scripts		scripts
src		src
train_config		train_config
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

MOSS-Video-Preview: Next-Generation Real-Time Video Understanding

🌟 Key Highlights

📌 Table of Contents

🔥 News

🏗️ Model Architecture

🌊 Real-Time Inference Process

⚙️ Inference Mechanism

🧩 Core Components

🎬 Demo

Real-Time Video

Offline Video

Offline Image Demo Video

📊 Training Stages & Data Composition

📊 Evaluation Results

📈 Streaming Inference Speed (Single-Setup Measurement)

🚀 Quick Start

Environment Setup

Example Data

Optional: Install PyTorch and FlashAttention2 (recommended for CUDA GPUs)

Run Inference

1. Offline Inference (Base/SFT checkpoints)

2. Real-Time SFT Offline Inference (Real-Time SFT checkpoints only)

3. Streaming Inference (Real-Time SFT checkpoints only)

🛠️ Training & Fine-tuning

📥 Model Download

🚀 Limitations & Future Outlook

📑 TODO List

Citation

Contributor Roles

Acknowledgement

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages