Kwai Keye-VL-2.0: Open-Source 30B MoE Multimodal Model with 256K Context for Long Video
Kwai
Kwai released Keye-VL-2.0, an open-source 30B Mixture-of-Experts multimodal model with 3B active parameters. Key advance: adapting sparse attention (derived from DeepSeek) to support lossless 256K-token context for hour-long video understanding. A novel training technique — Cross-Modal Multi-Teacher On-Policy Distillation — prevents catastrophic forgetting across tasks. Supports multimodal agentic workflows including code execution, tool use, and web search.
Why it matters
785 upvotes on HuggingFace — top paper of June 10. Delivers state-of-the-art long-video comprehension (Video-MME-v2, LongVideoBench, TimeLens) at a competitive parameter budget with full open weights and native agent capabilities. Raises the bar for open multimodal models.
Importance: 4/5
Top HF Daily Paper June 10 (785 upvotes, +1 bump); SOTA long-video multimodal at efficient MoE scale; full open weights with native agent capabilities.