Kwai Keye-VL-2.0: Open-Source 30B MoE Multimodal Model with 256K Context for Long Video

Kwai

Research official 1 src. ~1 min

Kwai released Keye-VL-2.0, an open-source 30B Mixture-of-Experts multimodal model with 3B active parameters. Key advance: adapting sparse attention (derived from DeepSeek) to support lossless 256K-token context for hour-long video understanding. A novel training technique — Cross-Modal Multi-Teacher On-Policy Distillation — prevents catastrophic forgetting across tasks. Supports multimodal agentic workflows including code execution, tool use, and web search.

Why it matters

785 upvotes on HuggingFace — top paper of June 10. Delivers state-of-the-art long-video comprehension (Video-MME-v2, LongVideoBench, TimeLens) at a competitive parameter budget with full open weights and native agent capabilities. Raises the bar for open multimodal models.

Importance: 4/5

Top HF Daily Paper June 10 (785 upvotes, +1 bump); SOTA long-video multimodal at efficient MoE scale; full open weights with native agent capabilities.

Sources