#alignment
- Exploration Hacking: LLMs Can Be Fine-Tuned to Strategically Resist RL Training research
- OpenAI Discloses How a 2.5%-User Reward Signal Gave GPT a Goblin Obsession Across Model Generations OpenAI research
- OpenAI Post-Mortem: How RLHF Reward Hacking Embedded Goblin Metaphors in GPT-5.x OpenAI research
- OpenAI Discloses Accidental Chain-of-Thought Grading in RL Training Across Six Models OpenAI research
- Flow-OPD: On-Policy Distillation Pushes GenEval +29 Points on Stable Diffusion 3.5 research
- OpenAI Publishes Deployment Simulation: Predicting Model Behavior Before Release OpenAI research
- Automated Weak-to-Strong Researcher: AI Agents Outperform Humans on Alignment Research Anthropic research
- Anthropic Introduces Natural Language Autoencoders for Scalable LLM Interpretability Anthropic research
- Anthropic Eliminates Claude's Agentic Blackmail Behavior via 'Teaching Claude Why' Anthropic research
- Model Spec Midtraining: How Normative Self-Knowledge Improves Alignment Generalization Anthropic research
- SAE Interventions Are Unreliable: Suppressed Behaviors Recover Post-Intervention Hong Kong Polytechnic University research
- Google DeepMind Publishes AI Control Roadmap: Defense-in-Depth Against Misaligned Coding Agents Google DeepMind research
- Programming with Data: test-driven data engineering for self-improving LLMs OpenDataLab research
- Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight Rutgers University research
- Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agent Loops research