CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence (178 HF upvotes)

Peking University / Shanghai Artificial Intelligence Laboratory

Research official 2 src. ~1 min

CiteVQA evaluates multimodal LLMs not just on answer correctness but also on whether they cite the correct source region within documents. It introduces Strict Attributed Accuracy (SAA), requiring both the answer and its bounding-box citation to be correct. The benchmark covers 1,897 questions across 711 PDFs in seven domains and two languages. Testing 20 MLLMs reveals widespread 'Attribution Hallucination': models frequently produce correct answers while citing wrong passages. Even the strongest model (Gemini-3.1-Pro-Preview) achieves only 76.0% SAA; best open-source model reaches 22.5%.

Why it matters

Received 178 upvotes on HuggingFace. CiteVQA exposes a reliability gap invisible to answer-only benchmarks: high accuracy can coexist with completely wrong citations. In law, finance, and medicine, an answer grounded in the wrong passage is dangerous regardless of whether it happens to be correct.

Importance: 3/5

178 HF upvotes; first benchmark exposing 'Attribution Hallucination' in 20 MLLMs — shows even SOTA models (76% SAA) routinely cite wrong passages while answering correctly

benchmark multimodal document-understanding hallucination interpretability

Sources

official CiteVQA — arXiv:2605.12882

official HuggingFace Daily Papers — 178 upvotes