QUBRIC: Co-Designing Queries and Rubrics Extends RLVR to Open-Ended Reasoning Domains
QUBRIC (arXiv 2606.03968) addresses a structural weakness in rubric-based RLVR: open-ended queries produce vague rubrics, but narrowing queries introduces fabricated references. The method jointly refines queries and rubrics — using teacher-derived key points to convert open-ended questions into scenario-specific ones, generating contrastive rubrics based on observed policy gaps, and filtering for informative training pairs. Results show a 5.5-point improvement on ArenaHard over SFT baselines, with 6.3-point average gains on legal, moral, and narrative reasoning.
Why it matters
Extends RL with verifiable rewards (RLVR) — which has driven recent reasoning breakthroughs — to subjective, open-ended domains where ground-truth answers do not exist, a significant step toward general-purpose reasoning models.
Importance: 2/5
Verified arxiv paper; extends RLVR beyond math and code to open-ended subjective reasoning domains.