QUBRIC: Co-Designing Queries and Rubrics Extends RLVR to Open-Ended Reasoning Domains

Research official 1 src. ~1 min

QUBRIC (arXiv 2606.03968) addresses a structural weakness in rubric-based RLVR: open-ended queries produce vague rubrics, but narrowing queries introduces fabricated references. The method jointly refines queries and rubrics — using teacher-derived key points to convert open-ended questions into scenario-specific ones, generating contrastive rubrics based on observed policy gaps, and filtering for informative training pairs. Results show a 5.5-point improvement on ArenaHard over SFT baselines, with 6.3-point average gains on legal, moral, and narrative reasoning.

Why it matters

Extends RL with verifiable rewards (RLVR) — which has driven recent reasoning breakthroughs — to subjective, open-ended domains where ground-truth answers do not exist, a significant step toward general-purpose reasoning models.

Importance: 2/5

Verified arxiv paper; extends RLVR beyond math and code to open-ended subjective reasoning domains.

rl reasoning training reward-modeling paper

Sources

official QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards — arXiv