multimodal — AI Digest

4 июн NVIDIA Releases Cosmos 3: Open Omnimodal World Foundation Model for Physical AI NVIDIA research
30 апр GLM-5V-Turbo: a natively multimodal foundation model for agents Z.ai research
13 мая Thinking Machines Lab Unveils TML-Interaction-Small: 276B MoE Real-Time Multimodal Model Thinking Machines Lab models-llm
13 мая SenseNova-U1: Open-Source Unified Multimodal Understanding and Generation via NEO-unify SenseTime research
15 мая MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Images Technion research
19 мая Google I/O 2026: Gemini 4, Jules V2, Firebase Studio GA, Android XR, and Aluminium OS Google DeepMind models-llm
20 мая Google Introduces Gemini Omni: Any-to-Any Video Generation in Consumer Products Google DeepMind video
2 июн Alibaba Launches Qwen3.7-Plus: Multimodal Agent with Vision, Reasoning, and Autonomous Execution Alibaba / Qwen models-llm
2 июн MiniMax Releases M3: Open-Weight Frontier Model with 1M-Token Context and MSA Architecture MiniMax models-llm
4 июн Google DeepMind Releases Gemma 4 12B: Encoder-Free Multimodal Model That Runs on a 16 GB Laptop Google DeepMind models-llm
10 июн MiniMax M3 Open Weights Released: 1M Context, MoE, Frontier Coding MiniMax models-llm
2 мая Eywa: heterogeneous collaboration framework between LLM agents and scientific foundation models University of Illinois at Urbana-Champaign research
3 мая MiniCPM-o 4.5: Real-Time Full-Duplex Omni-Modal AI on Edge Devices OpenBMB / Tsinghua University research
5 мая AI2 Open-Sources MolmoAct2: Robotics VLA That Claims to Beat GPT-5 on Embodied Reasoning AI2 research
5 мая UniVidX: One Diffusion Backbone for RGB, Intrinsic Maps, and RGBA Video Generation research
9 мая ByteDance Launches Doubao-Seed-2.0-lite: First Omni-Modal Model in Seed Series ByteDance models-llm
12 мая Qwen-Image-2.0: Unified Image Generation and Editing at 2K Resolution, Top-1 on AI Arena Alibaba research
13 мая Google DeepMind Unveils Magic Pointer: AI-Aware Mouse Cursor for Chrome and Googlebook Google DeepMind tools
20 мая Lance: 3B Unified Multimodal Model for Understanding, Generation, and Editing (314 HF upvotes) ByteDance Research research
19 мая LongLive-2.0: NVFP4 Parallel Infrastructure for Long Video Generation (NVIDIA, 1,220 HF upvotes) NVIDIA research
11 июн Kwai Keye-VL-2.0: Open-Source 30B MoE Multimodal Model with 256K Context for Long Video Kwai research
14 июн Moonshot AI Releases Kimi K2.7-Code: 1T-Parameter Open-Weight Coding Model with Vision Moonshot AI models-llm
17 июн JoyAI-VL-Interaction: Open-Source 8B Real-Time VLM with Autonomous Turn-Taking JD.com research
7 мая RLDX-1: Multi-Stream Action Transformer Achieves 86.8% on ALLEX Humanoid Tasks RLWRLD research
9 мая OpenSearch-VL: Open Recipe for Training Frontier Multimodal Search Agents Tencent Hunyuan research
16 мая SANA-WM: Minute-Scale 720p World Modeling on a Single GPU NVIDIA research
16 мая MemLens: Benchmark for Multimodal Long-Term Memory in Vision-Language Models NVIDIA research
19 мая CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence (178 HF upvotes) Peking University / Shanghai Artificial Intelligence Laboratory research
19 мая PhysBrain 1.0: Human Egocentric Video as Robot Training Data for VLA Models (133 HF upvotes) DeepCybo research
19 мая MMSkills: Reusable Multimodal Skills for General Visual Agents (105 HF upvotes) Shanghai Jiao Tong University research
6 июн Audio Interaction Model: Unified Streaming Framework Combining Offline and Real-Time Audio Instruction Following research
11 июн Z-Reward: Score Distributions Instead of Scalar Rewards for Image Generation RLHF Alibaba research
14 июн vLLM Adds Day-0 Support for MiniMax M3 Open Weights with 1M-Context Sparse Attention MiniMax tools
14 июн InterleaveThinker: RL Planner+Critic Pipeline for Interleaved Text-and-Image Generation CUHK Multimedia Lab research
2 мая CoPD: co-evolving policy distillation for unified multi-capability models research
5 мая Odysseus: Training VLMs for 100+ Turn Interactive Decision-Making via RL Princeton University research
13 мая World Action Models: First Systematic Survey of Embodied Foundation Models Unifying World Modeling and Action OpenMOSS research
20 мая Google Project Genie World Model Now Simulates Real Places Using Street View Google DeepMind research
12 июн InterleaveThinker: RL Framework for Agentic Text-and-Image Interleaved Generation research
12 июн Astra: RL-Trained VLM Queries World Simulator for Spatial Reasoning research
29 апр DeepSeek launches image recognition mode in a gray-scale test DeepSeek models-llm
12 мая Google's Gemini 'Omni' Video Model Surfaces in Early Demos Ahead of I/O 2026 Google DeepMind video
16 мая llama.cpp b9161/b9169: Codex CLI Compatibility and Qwen3A Multimodal Support ggml-org tools
8 июн VideoKR: 315K-Example Training Corpus for Knowledge- and Reasoning-Intensive Video Understanding Yale University research
9 июн Echo-Memory: Controlled Study of Memory Mechanisms in Action-Conditioned Video World Models Microsoft Research research
10 июн SCAIL-2: End-to-End Character Animation via In-Context Conditioning Tsinghua University research
19 июн llama.cpp b9716 Builds: InternVL Multimodal Batching, CUDA col2im, and Nginx SSE Fix tools
19 июн StylisticBias: 15 Visual Attributes Account for 80% of Social Bias in Multimodal LLMs research
19 июн Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agent Loops research
28 апр Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond HKUST/NUS/Oxford/NTU research