MOSS-Video-Preview is a multimodal vision foundation model specifically engineered for real-time video understanding. Built upon the Llama-3.2-Vision architecture, we have comprehensively extended the model's native video processing capabilities, empowering it with state-of-the-art real-time multimodal reasoning performance.
Important
π‘ Project Note:
At this stage, this project serves as an exploratory endeavor, leveraging high-quality open-source datasets to validate the potential of the Cross-Attention architecture for native real-time video understanding. This is only the beginning; we have committed to a comprehensive scaling roadmap across three dimensions: Data Scaling, Parameter Scaling, and Context Scaling, with the goal of building more robust and general-purpose video intelligence.
As we strive to build more robust and general-purpose video intelligence, we warmly welcome experts in Representation Learning, Model Compression and Inference Acceleration to join our journey. Whether you are optimizing inference latency or exploring efficient architecture, we invite you to experiment and innovate on top of our framework. Let's push the boundaries of video intelligence and advance the open-source community together!
-
π§© Image-Video Cross-Attention Architecture: By transcending the limitations of conventional architectures, MOSS-Video-Preview leverages a native Cross-Attention mechanism to provide unified image-video understanding. This approach enables deep decoupling of visual and linguistic features, facilitating seamless and continuous analysis of ultra-long temporal sequences.
-
π Millisecond-Level Interaction & Dynamic Self-Correction: The system supports seamless transitions between "Silence" and "Speak" modes. With enhanced contextual awareness, the model allows for real-time interruptions to adjust or refine responses dynamically as video scenes evolve, delivering a truly responsive, full-duplex user experience.
-
β‘ Extreme Inference Performance & Kernel Optimization: By leveraging deeply optimized Cross-Attention kernels and Flash Attention 2 acceleration on both CUDA and NPU platforms, MOSS-Video-Preview is specifically engineered for long-form video processing. It achieves ultra-low latency while significantly reducing memory overhead.
-
π Fine-grained Data Synthesis Pipeline: We have engineered a sophisticated data synthesis pipeline for real-time video understanding, powered by state-of-the-art multimodal LLMs. We are committed to open-sourcing these datasets in the near future to support the research community and collectively advance the frontier of real-time video perception.
- π₯ News
- ποΈ Model Architecture
- π Real-Time Inference Process
- π¬ Demo
- π Training Stages & Data Composition
- π Evaluation Results
- π Streaming Inference Speed (Single-Setup Measurement)
- π Quick Start
- π οΈ Training & Fine-tuning
- π₯ Model Download
- π‘ Limitations & Future Outlook
- π TODO List
- Citation
- Acknowledgement
- 2026/04/08: π MOSS-VL is officially open-sourced! Released MOSS-VL-Base-0408 and MOSS-VL-Instruct-0408.
- 2026/03/04: π MOSS-Video-Preview source code and architecture details released!
- 2025/10/18: π§ Post-mortem on current issues; started MOSS-VL project.
- 2025/10/08: π¬ Internal demo showcased within the lab and the school.
- 2025/09: π moss-video-preview-realtime-sft ready.
- 2025/08: β moss-video-preview-sft ready.
Built on a native Real-Time temporal architecture, MOSS-Video-Preview decouples visual perception and linguistic reasoning to minimize computational latency. This enables millisecond-level streaming performance, ensuring a highly responsive and fluid interactive experience for continuous video streams.
Figure 1: Overall architecture of MOSS-Video-Preview.
The core strength of MOSS-Video-Preview lies in its native real-time streaming capability, enabling continuous, low-latency processing of live video feeds.
Figure 2: Real-Time inference pipeline.
- Asynchronous Real-Time Input Video frames are continuously injected at a stable frame rate for high-frequency real-time perception. The input process is non-blocking and fully decoupled from the text generation loop, ensuring uninterrupted visual tracking.
- Long-range State Persistence Leveraging Cross-Attention KV Cache and Temporal Positional Encoding, the model maintains robust contextual dependencies across continuous frames, ensuring coherent temporal understanding over extended sequences.
- Ultra-Low Latency Streaming Response The model supports simultaneous autoregressive generation alongside the incoming video stream. By eliminating the need for full-clip buffering, it achieves "on-the-fly" reasoning and interaction with minimal end-to-end latency.
- Cross-Modal Projector
Featuring the proprietary
VideoMllamaTextCrossAttentionmechanism, this component utilizes bidirectional cross-attention to achieve highly efficient fusion and semantic alignment between temporal visual features and linguistic context. - Streaming Causal Decoding Module A specialized decoder for autoregressive generation based on dynamic visual inputs. It possesses dynamic adaptability, allowing it to real-time adjust and refine generated content based on the latest visual cues captured from the stream.
streaming_demo.mp4
video_demo.mp4
image_demo.mp4
MOSS-Video-Preview employs a three-stage progressive training strategy to evolve the model from basic modality alignment to complex real-time video reasoning.
| Stage | Core Objective | Trainable Parameters | Data Mixture (T / I / V) | Training Samples |
|---|---|---|---|---|
| PT-Stage 1 | Cross-modal Alignment | Vision Projector only | 0% / 79% / 21% | 15.1 M |
| PT-Stage 2 | Temporal & Long Video Perception | Full Parameters | 0% / 26% / 74% | 1.8 M |
| Offline SFT | Instruction Following & Reasoning | Full Parameters | 14% / 44% / 42% | 8.6 M |
| Real-Time SFT | Real-Time understanding and reasoning | Full Parameters | 11% / 29% / 60% | 836 K |
-
Performance Consistency in the Real-time Variant: Experimental results show that MOSS-Video-Preview-Realtime-SFT achieves near-lossless performance retention. Its performance remains highly consistent with the standard SFT version across MMBench, AI2D, and the majority of video benchmarks, even showing superior results in specific temporal understanding tasks like TempCompass. This confirms the model's capability to balance real-time response requirements with high-precision perception in real-world deployment scenarios.
-
Visual Logical Reasoning: In the Multimodal Reasoning category, the MOSS series demonstrates robust logical deduction performance. Notably, on the VisuLogic benchmark, both MOSS variants (28.60 / 28.70) outperform LLaVA-OneVision (27.00) and Qwen2.5-VL (25.90). This reflects the models' superior stability when handling logically challenging tasks such as visual patterns and spatial reasoning.
-
Fine-grained Video Insight: The MOSS series shows significant competitive edge in fine-grained action logic and spatio-temporal perception within the video understanding domain. On the Video-Holmes benchmark, the MOSS series achieved high scores of 39.30 / 39.50, while Qwen2.5-VL scored 33.00. These results indicate that MOSS possesses a deeper perceptual capacity for capturing subtle motions and complex spatio-temporal dynamics in long video sequences compared to other open-source models of the same scale.
The core optimization of MOSS-Video-Preview lies in bridging the gap between high-quality reasoning and low-latency real-time streaming, as further evidenced by the speed measurement below.
We measure streaming inference speed of MOSS-Video-Preview against another strong open-source video model under the same hardware and decoding configuration (this is a single-setup speed comparison, not a standardized benchmark suite).
- Hardware: NVIDIA H200 (single GPU)
- Video sampling: 256 extracted frames
- Input video:
- Path:
data/example_video.mp4 - Resolution: 1920Γ1080
- Duration: 97.56 s
- Bitrate: 2223.33 kbps (approx)
- Path:
Speed comparison (higher TPS and lower latency are better):
| Model | Frames | Parameters | Avg TTFT (s) | Avg TPS (tokens/s) | Avg Total Latency (s) | P95 TTFT (s) |
|---|---|---|---|---|---|---|
| MOSS-Video-Preview | 256 | 11B | 1.9537 | 38.41 | 28.5104 | 1.9573 |
| Qwen2.5-VL-7B | 256 | 7B | 9.9402 | 14.26 | 52.7624 | 9.9564 |
Under this setting, MOSS-Video-Preview delivers ~5Γ faster TTFT, ~2.7Γ higher decoding throughput (TPS), and significantly lower end-to-end latency compared with Qwen2.5-VL-7B, making it highly suitable for Real-Time Video Understanding. It remains strongly competitive even with a larger parameter count, indicating substantial headroom for further speedup in larger-scale settings.
conda create -n moss-video python=3.12.4 -y
conda activate moss-video
pip install -e .This repository includes a small set of example files:
- Video:
data/example_video.mp4 - Image:
data/example_image.jpg
Tested setup: Python 3.12.4 + PyTorch 2.4.0 (CUDA 12.1) + DeepSpeed 0.16.1.
First, install PyTorch (select the appropriate build for your CUDA/CPU environment), then install FlashAttention2 and DeepSpeed:
# CUDA 12.1 (recommended)
pip install --index-url https://download.pytorch.org/whl/cu121 "torch==2.4.0"
# CPU-only (if CUDA is unavailable)
# pip install --index-url https://download.pytorch.org/whl/cpu "torch==2.4.0"
pip install -e ".[flash-attn,deepspeed]" --no-build-isolationMOSS-Video-Preview supports offline, and streaming inference modes.
Offline inference processes the entire video at once. This is suitable for batch processing or analyzing pre-recorded videos.
# Run offline inference demo
python -m inference.offline_infer \
--checkpoint models/moss-video-sft \
--video_path data/example_video.mp4 \
--prompt "Describe the video." \
--max_new_tokens 512This mode runs offline (non-streaming) generation but must use a Real-Time SFT checkpoint (the same type as used for streaming inference). It is not compatible with base or plain SFT checkpoints.
# Run Real-Time SFT offline inference demo
python -m inference.realtime_offline_infer \
--checkpoint models/moss-video-realtime-sft \
--video_path data/example_video.mp4 \
--prompt "Describe the video." \
--max_new_tokens 512Streaming inference processes video frames in real-time as they are received. This is ideal for live streams or low-latency applications.
# Run streaming inference demo
python -m inference.realtime_streaming_infer \
--checkpoint models/moss-video-realtime-sft \
--video_path data/example_video.mp4 \
--prompt "Describe the video." \
--max_new_tokens 512The streaming inference uses a unified pipeline where frames are fed into an image_queue and tokens are consumed from a token_queue in real-time.
MOSS-Video-Preview supports a variety of training modes via LlamaFactory integration.
| Mode | VRAM (GB/GPU) | Hardware | Config File |
|---|---|---|---|
| PT (Pretrain) | β80GB | H100/H200 | mllm_pretrain_1node.yaml |
| SFT (Offline) | β80GB | H100/H200 | mllm_offline_sft_1node.yaml |
| SFT (Real-time) | β80GB | H100/H200 | mllm_realtime_sft_1node.yaml |
To start training, use the following command:
FORCE_TORCHRUN=1 llamafactory-cli train train_config/mllm_pretrain_1node.yamlYou can choose different configuration files from the train_config directory based on the training stage:
- pretrain:
train_config/mllm_pretrain_1node.yaml - sft-offline:
train_config/mllm_offline_sft_1node.yaml - sft-realtime:
train_config/mllm_realtime_sft_1node.yaml
| Model | π€Download Link | π€ModelScope Link |
|---|---|---|
| moss-video-preview-base | HuggingFace | ModelScope |
| moss-video-preview-sft | HuggingFace | ModelScope |
| moss-video-preview-realtime-sft | HuggingFace | ModelScope |
- Performance Benchmarking: While the real-time comprehension capability has been successfully validated, a performance gap remains compared to top-tier semi-open-source models such as Qwen2.5-VL. Closing this gap and aligning with SOTA benchmarks is a primary focus for our future iterations.
- Scalable Distributed Training: The current training pipeline is primarily optimized for architectural validation. We plan to integrate the Megatron-LM framework to leverage advanced 3D parallelism (Tensor, Pipeline, and Data Parallelism) for large-scale pre-training and fine-tuning. In the next major release, we will officially open-source the complete training codebase, model weights, and configurations.
- Data Scaling & Diversity: Our current training relies heavily on public datasets. Future updates will focus on expanding the scale and diversity of our multimodal data to enhance the model's generalizability and overall robustness across a wider range of real-world scenarios.
- Unified Position Encoding
- NPU/CUDA Flash Attention 2 Integration
- Streaming Vision Encoder
- LlamaFactory Training Support
- Technical Report
- Open-source Moss-VL
@misc{moss_video_2026,
title = {{MOSS-Video-Preview: Next-Generation Real-Time Video Understanding}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/OpenMOSS/MOSS-Video-Preview}},
note = {GitHub repository}
}- Core Contributor: Pengyu Wang*, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng
- Contributor: Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Hongkai Wang, Pengfei Wang, Chenghao Liu, Shanqing Gao, Yixian Tian, Xinghao Wang, Botian Jiang, Xipeng Qiuβ
Legend: * Project Leader; β Corresponding Author
We extend our gratitude to the contributors of LlamaFactory, Transformers, and the OpenMOSS community for their invaluable support.

