RubricEM: Meta-RL with Rubric-Guided Policy Decomposition Beyond Verifiable Rewards

Google

Research official + media 2 src. ~1 min

RubricEM proposes using rubrics as a shared interface that structures policy execution, judge feedback, and agent memory across the full research-agent lifecycle. The framework combines stagewise policy decomposition with a novel Stage-Structured GRPO objective for denser semantic rewards during long-horizon tasks. RubricEM-8B matches proprietary deep-research systems on four long-form research benchmarks.

Why it matters

Addresses a fundamental limitation of RLVR: most tasks do not have verifiable ground-truth rewards. By using rubrics as structured reward signals, this extends RL fine-tuning to open-ended tasks like evidence synthesis and report writing.

Importance: 3/5

Google research — meta-RL beyond verifiable rewards, extends RL to open-ended research tasks, HF Daily Papers 56 upvotes

rl reasoning agents paper benchmark

Sources

official RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

secondary HuggingFace Daily Papers — RubricEM (56 upvotes)