Astra: RL-Trained VLM Queries World Simulator for Spatial Reasoning
Astra combines an RL-trained VLM policy (Astra-VL) with a world simulator (Astra-WM) built on Bagel. During spatial reasoning, the model issues natural-language camera instructions to the simulator to imagine novel viewpoints. Astra-WM boosts Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5; Astra-VL lifts Qwen3-VL from 29.8 to 38.8 on MMSI-Bench and 36.8 to 42.7 on MindCube.
Why it matters
Spatial reasoning from limited viewpoints is a longstanding VLM weakness. Astra demonstrates that actively imagining new views via RL-trained tool use is tractable and yields measurable gains on established 3D reasoning benchmarks.
Importance: 2/5
Novel architecture for VLM spatial reasoning with measurable benchmark gains