VRRL: Visually Grounded Self-Reflection for Vision-Language Models via RL

UT Austin / Cornell

Research official 1 src. ~1 min

VRRL introduces two RL-based mechanisms to help VLMs correct their own errors using actual visual evidence rather than language priors. Trajectory masking trains models to recover from mid-sequence mistakes; buffered roll-in exposes models to diverse failure states. Tested on out-of-distribution visual grounding benchmarks (tables, charts, spatial navigation), VRRL substantially outperforms standard RL and reflection-focused fine-tuning baselines.

Why it matters

VLMs often fall back on language statistics when self-correcting rather than looking at the image. VRRL directly targets this gap; gains on tables and charts are relevant for document understanding.

Importance: 2/5

RL approach targeting VLM visual grounding during self-correction with practical gains on out-of-distribution document tasks

vlm rl reasoning multimodal reinforcement-learning

Sources

official Visually Grounded Self-Reflection for VLMs via RL (arXiv)