Z-Reward: Score Distributions Instead of Scalar Rewards for Image Generation RLHF
Alibaba
Z-Reward replaces single scalar reward values with distributions over rubric scores for RLHF in text-to-image generation. A 27B teacher model reasons explicitly to produce score distributions; a student model internalizes this reasoning at inference time via Reasoning-Internalized Score Distillation (RISD), without needing chain-of-thought at runtime. Group-wise Direct Score Optimization (GDSO) combines policy-gradient rewards with direct distribution supervision. The 27B teacher achieves 89.6% human preference accuracy; the 9B student matches at 88.6%; as a differentiable reward signal during generation, achieves 41.3% net human-preference improvement.
Why it matters
34 upvotes on HuggingFace June 11. The distribution-over-rubrics framing generalizes beyond image generation to any RLHF domain where scalar rewards lose signal. The 89.6% human preference accuracy surpasses all reported baselines at the teacher scale.
Importance: 3/5
Notable research from Alibaba; 89.6% human preference accuracy is SOTA; distribution-based reward modeling with broad applicability to other RLHF domains.