Lance: 3B Unified Multimodal Model for Understanding, Generation, and Editing (314 HF upvotes)

ByteDance Research

Research official + media 2 src. ~1 min

Lance is a 3B-active-parameter native unified multimodal model supporting image and video understanding, generation, and editing — trained from scratch. It employs a dual-stream mixture-of-experts architecture over shared interleaved multimodal sequences with modality-aware rotary positional encoding, substantially outperforming existing open-source unified models on image and video generation benchmarks while retaining strong comprehension.

Why it matters

314 HuggingFace upvotes; demonstrates a lean 3B unified model trained with a careful multi-task recipe can rival much larger single-task specialists across the full understanding-generation spectrum

Importance: 3/5

314 HF upvotes; state-of-the-art unified understanding+generation in 3B params from ByteDance — challenges both specialist and larger unified models

multimodal moe video-generation image-generation architecture

Sources

official Lance: Unified Multimodal Modeling by Multi-Task Synergy — arXiv:2605.18678

media Lance — HuggingFace Daily Papers (314 upvotes)