VRRL: Visually Grounded Self-Reflection for Vision-Language Models via RL
UT Austin / Cornell
VRRL introduces two RL-based mechanisms to help VLMs correct their own errors using actual visual evidence rather than language priors. Trajectory masking trains models to recover from mid-sequence mistakes; buffered roll-in exposes models to diverse failure states. Tested on out-of-distribution visual grounding benchmarks (tables, charts, spatial navigation), VRRL substantially outperforms standard RL and reflection-focused fine-tuning baselines.
Why it matters
VLMs often fall back on language statistics when self-correcting rather than looking at the image. VRRL directly targets this gap; gains on tables and charts are relevant for document understanding.
Importance: 2/5
RL approach targeting VLM visual grounding during self-correction with practical gains on out-of-distribution document tasks