Abstract
Motif-Video 2B achieves high text-to-video generation quality using a specialized architecture with shared cross-attention and three-part backbone, along with efficient training methods, while requiring significantly fewer parameters and training data than larger models.
Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7times fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Nucleus-Image: Sparse MoE for Image Generation (2026)
- SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model (2026)
- TrajTok: Learning Trajectory Tokens enables better Video Understanding (2026)
- Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model (2026)
- T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability (2026)
- PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference (2026)
- EFlow: Fast Few-Step Video Generator Training from Scratch via Efficient Solution Flow (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.16503 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper