Wan-Streamer v0.1: End-to-End Real-Time Interactive Foundation Model Under 550ms Latency

Wan-AI

Research official + media 2 src. ~1 min

A unified foundation model for real-time multimodal interaction handling language, audio, and video in a single Transformer with block-causal attention. Unlike pipeline systems chaining separate ASR, reasoning, and TTS modules, Wan-Streamer jointly learns perception, reasoning, and generation — achieving ~200ms model-side latency and 550ms total interaction latency, with streaming units as short as 160ms at 25 fps. Currently at 192p resolution as proof of concept.

Why it matters

Real-time interactive AI where a model sees, hears, and responds with audio and video within half a second has been a hard systems problem. Wan-Streamer demonstrates that end-to-end joint training in a single Transformer can match latency targets previously requiring specialized pipeline glue.

Importance: 2/5

Novel architecture achieving <550ms full-duplex multimodal latency; opens path for sub-second AI interaction

multimodal streaming real-time audio paper architecture

Sources

official arXiv:2606.25041 — Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

media HuggingFace Daily Papers — June 25, 2026 (22 upvotes)