Astra: RL-Trained VLM Queries World Simulator for Spatial Reasoning

Research official + media 2 src. ~1 min

Astra combines an RL-trained VLM policy (Astra-VL) with a world simulator (Astra-WM) built on Bagel. During spatial reasoning, the model issues natural-language camera instructions to the simulator to imagine novel viewpoints. Astra-WM boosts Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5; Astra-VL lifts Qwen3-VL from 29.8 to 38.8 on MMSI-Bench and 36.8 to 42.7 on MindCube.

Why it matters

Spatial reasoning from limited viewpoints is a longstanding VLM weakness. Astra demonstrates that actively imagining new views via RL-trained tool use is tractable and yields measurable gains on established 3D reasoning benchmarks.

Importance: 2/5

Novel architecture for VLM spatial reasoning with measurable benchmark gains

vlm reasoning world-models multimodal rl vision-language paper

Sources

official Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators — arXiv

media HuggingFace Paper Page — arXiv:2606.06476