Z-Reward: Score Distributions Instead of Scalar Rewards for Image Generation RLHF

Alibaba

Research official 1 src. ~1 min

Z-Reward replaces single scalar reward values with distributions over rubric scores for RLHF in text-to-image generation. A 27B teacher model reasons explicitly to produce score distributions; a student model internalizes this reasoning at inference time via Reasoning-Internalized Score Distillation (RISD), without needing chain-of-thought at runtime. Group-wise Direct Score Optimization (GDSO) combines policy-gradient rewards with direct distribution supervision. The 27B teacher achieves 89.6% human preference accuracy; the 9B student matches at 88.6%; as a differentiable reward signal during generation, achieves 41.3% net human-preference improvement.

Why it matters

34 upvotes on HuggingFace June 11. The distribution-over-rubrics framing generalizes beyond image generation to any RLHF domain where scalar rewards lose signal. The 89.6% human preference accuracy surpasses all reported baselines at the teacher scale.

Importance: 3/5

Notable research from Alibaba; 89.6% human preference accuracy is SOTA; distribution-based reward modeling with broad applicability to other RLHF domains.

rl reward-modeling multimodal reasoning rlhf

Sources

official arXiv:2606.09076 — Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions