VideoKR: 315K-Example Training Corpus for Knowledge- and Reasoning-Intensive Video Understanding
Yale University
VideoKR introduces a 315K-example training corpus for knowledge- and reasoning-intensive video understanding, built from 145K CC-licensed expert-domain videos with chain-of-thought rationales at progressively deeper reasoning depths. Includes VideoKR-Eval, an expert-annotated benchmark requiring genuine video-grounded reasoning rather than textual shortcuts. SFT followed by GRPO post-training on VideoKR outperforms prior post-training approaches.
Why it matters
Multimodal reasoning benchmarks have been criticized for being solvable from text alone. VideoKR targets this gap with video-grounded knowledge reasoning, providing both training data and evaluation infrastructure for progress on genuinely vision-dependent tasks.
Importance: 2/5
Official arXiv + HuggingFace; large-scale dataset and benchmark addressing a documented shortcut problem in multimodal evaluation.