CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence (178 HF upvotes)
Peking University / Shanghai Artificial Intelligence Laboratory
CiteVQA evaluates multimodal LLMs not just on answer correctness but also on whether they cite the correct source region within documents. It introduces Strict Attributed Accuracy (SAA), requiring both the answer and its bounding-box citation to be correct. The benchmark covers 1,897 questions across 711 PDFs in seven domains and two languages. Testing 20 MLLMs reveals widespread 'Attribution Hallucination': models frequently produce correct answers while citing wrong passages. Even the strongest model (Gemini-3.1-Pro-Preview) achieves only 76.0% SAA; best open-source model reaches 22.5%.
Why it matters
Received 178 upvotes on HuggingFace. CiteVQA exposes a reliability gap invisible to answer-only benchmarks: high accuracy can coexist with completely wrong citations. In law, finance, and medicine, an answer grounded in the wrong passage is dangerous regardless of whether it happens to be correct.
Importance: 3/5
178 HF upvotes; first benchmark exposing 'Attribution Hallucination' in 20 MLLMs — shows even SOTA models (76% SAA) routinely cite wrong passages while answering correctly