RubricEM: Meta-RL with Rubric-Guided Policy Decomposition Beyond Verifiable Rewards
RubricEM proposes using rubrics as a shared interface that structures policy execution, judge feedback, and agent memory across the full research-agent lifecycle. The framework combines stagewise policy decomposition with a novel Stage-Structured GRPO objective for denser semantic rewards during long-horizon tasks. RubricEM-8B matches proprietary deep-research systems on four long-form research benchmarks.
Why it matters
Addresses a fundamental limitation of RLVR: most tasks do not have verifiable ground-truth rewards. By using rubrics as structured reward signals, this extends RL fine-tuning to open-ended tasks like evidence synthesis and report writing.
Importance: 3/5
Google research — meta-RL beyond verifiable rewards, extends RL to open-ended research tasks, HF Daily Papers 56 upvotes