#rlhf
- OpenAI Discloses How a 2.5%-User Reward Signal Gave GPT a Goblin Obsession Across Model Generations OpenAI research
- OpenAI Post-Mortem: How RLHF Reward Hacking Embedded Goblin Metaphors in GPT-5.x OpenAI research
- Z-Reward: Score Distributions Instead of Scalar Rewards for Image Generation RLHF Alibaba research
- NudgeRL: Strategy-Level Context Nudges for Efficient RLVR Exploration KAIST AI research
- Anatomy of Post-Training: Using Interpretability to Audit and Fix Preference Data research