-
NVIDIA Releases Cosmos 3: Open Omnimodal World Foundation Model for Physical AI
NVIDIA
research
-
GLM-5V-Turbo: a natively multimodal foundation model for agents
Z.ai
research
-
Thinking Machines Lab Unveils TML-Interaction-Small: 276B MoE Real-Time Multimodal Model
Thinking Machines Lab
models-llm
-
SenseNova-U1: Open-Source Unified Multimodal Understanding and Generation via NEO-unify
SenseTime
research
-
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Images
Technion
research
-
Google I/O 2026: Gemini 4, Jules V2, Firebase Studio GA, Android XR, and Aluminium OS
Google DeepMind
models-llm
-
Google Introduces Gemini Omni: Any-to-Any Video Generation in Consumer Products
Google DeepMind
video
-
Alibaba Launches Qwen3.7-Plus: Multimodal Agent with Vision, Reasoning, and Autonomous Execution
Alibaba / Qwen
models-llm
-
MiniMax Releases M3: Open-Weight Frontier Model with 1M-Token Context and MSA Architecture
MiniMax
models-llm
-
Google DeepMind Releases Gemma 4 12B: Encoder-Free Multimodal Model That Runs on a 16 GB Laptop
Google DeepMind
models-llm
-
MiniMax M3 Open Weights Released: 1M Context, MoE, Frontier Coding
MiniMax
models-llm
-
Eywa: heterogeneous collaboration framework between LLM agents and scientific foundation models
University of Illinois at Urbana-Champaign
research
-
MiniCPM-o 4.5: Real-Time Full-Duplex Omni-Modal AI on Edge Devices
OpenBMB / Tsinghua University
research
-
AI2 Open-Sources MolmoAct2: Robotics VLA That Claims to Beat GPT-5 on Embodied Reasoning
AI2
research
-
UniVidX: One Diffusion Backbone for RGB, Intrinsic Maps, and RGBA Video Generation
research
-
ByteDance Launches Doubao-Seed-2.0-lite: First Omni-Modal Model in Seed Series
ByteDance
models-llm
-
Qwen-Image-2.0: Unified Image Generation and Editing at 2K Resolution, Top-1 on AI Arena
Alibaba
research
-
Google DeepMind Unveils Magic Pointer: AI-Aware Mouse Cursor for Chrome and Googlebook
Google DeepMind
tools
-
Lance: 3B Unified Multimodal Model for Understanding, Generation, and Editing (314 HF upvotes)
ByteDance Research
research
-
LongLive-2.0: NVFP4 Parallel Infrastructure for Long Video Generation (NVIDIA, 1,220 HF upvotes)
NVIDIA
research
-
Kwai Keye-VL-2.0: Open-Source 30B MoE Multimodal Model with 256K Context for Long Video
Kwai
research
-
Moonshot AI Releases Kimi K2.7-Code: 1T-Parameter Open-Weight Coding Model with Vision
Moonshot AI
models-llm
-
JoyAI-VL-Interaction: Open-Source 8B Real-Time VLM with Autonomous Turn-Taking
JD.com
research
-
RLDX-1: Multi-Stream Action Transformer Achieves 86.8% on ALLEX Humanoid Tasks
RLWRLD
research
-
OpenSearch-VL: Open Recipe for Training Frontier Multimodal Search Agents
Tencent Hunyuan
research
-
SANA-WM: Minute-Scale 720p World Modeling on a Single GPU
NVIDIA
research
-
MemLens: Benchmark for Multimodal Long-Term Memory in Vision-Language Models
NVIDIA
research
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence (178 HF upvotes)
Peking University / Shanghai Artificial Intelligence Laboratory
research
-
PhysBrain 1.0: Human Egocentric Video as Robot Training Data for VLA Models (133 HF upvotes)
DeepCybo
research
-
MMSkills: Reusable Multimodal Skills for General Visual Agents (105 HF upvotes)
Shanghai Jiao Tong University
research
-
Audio Interaction Model: Unified Streaming Framework Combining Offline and Real-Time Audio Instruction Following
research
-
Z-Reward: Score Distributions Instead of Scalar Rewards for Image Generation RLHF
Alibaba
research
-
vLLM Adds Day-0 Support for MiniMax M3 Open Weights with 1M-Context Sparse Attention
MiniMax
tools
-
InterleaveThinker: RL Planner+Critic Pipeline for Interleaved Text-and-Image Generation
CUHK Multimedia Lab
research
-
CoPD: co-evolving policy distillation for unified multi-capability models
research
-
Odysseus: Training VLMs for 100+ Turn Interactive Decision-Making via RL
Princeton University
research
-
World Action Models: First Systematic Survey of Embodied Foundation Models Unifying World Modeling and Action
OpenMOSS
research
-
Google Project Genie World Model Now Simulates Real Places Using Street View
Google DeepMind
research
-
InterleaveThinker: RL Framework for Agentic Text-and-Image Interleaved Generation
research
-
Astra: RL-Trained VLM Queries World Simulator for Spatial Reasoning
research
-
DeepSeek launches image recognition mode in a gray-scale test
DeepSeek
models-llm
-
Google's Gemini 'Omni' Video Model Surfaces in Early Demos Ahead of I/O 2026
Google DeepMind
video
-
llama.cpp b9161/b9169: Codex CLI Compatibility and Qwen3A Multimodal Support
ggml-org
tools
-
VideoKR: 315K-Example Training Corpus for Knowledge- and Reasoning-Intensive Video Understanding
Yale University
research
-
Echo-Memory: Controlled Study of Memory Mechanisms in Action-Conditioned Video World Models
Microsoft Research
research
-
SCAIL-2: End-to-End Character Animation via In-Context Conditioning
Tsinghua University
research
-
llama.cpp b9716 Builds: InternVL Multimodal Batching, CUDA col2im, and Nginx SSE Fix
tools
-
StylisticBias: 15 Visual Attributes Account for 80% of Social Bias in Multimodal LLMs
research
-
Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agent Loops
research
-
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
HKUST/NUS/Oxford/NTU
research