VideoKR: 315K-Example Training Corpus for Knowledge- and Reasoning-Intensive Video Understanding

Yale University

Research official 2 src. ~1 min

VideoKR introduces a 315K-example training corpus for knowledge- and reasoning-intensive video understanding, built from 145K CC-licensed expert-domain videos with chain-of-thought rationales at progressively deeper reasoning depths. Includes VideoKR-Eval, an expert-annotated benchmark requiring genuine video-grounded reasoning rather than textual shortcuts. SFT followed by GRPO post-training on VideoKR outperforms prior post-training approaches.

Why it matters

Multimodal reasoning benchmarks have been criticized for being solvable from text alone. VideoKR targets this gap with video-grounded knowledge reasoning, providing both training data and evaluation infrastructure for progress on genuinely vision-dependent tasks.

Importance: 2/5

Official arXiv + HuggingFace; large-scale dataset and benchmark addressing a documented shortcut problem in multimodal evaluation.

multimodal video-generation reasoning benchmark paper

Sources

official VideoKR — arXiv:2606.05259

official VideoKR — HuggingFace Papers