alignment — AI Digest

3 мая Exploration Hacking: LLMs Can Be Fine-Tuned to Strategically Resist RL Training research
3 мая OpenAI Discloses How a 2.5%-User Reward Signal Gave GPT a Goblin Obsession Across Model Generations OpenAI research
6 мая OpenAI Post-Mortem: How RLHF Reward Hacking Embedded Goblin Metaphors in GPT-5.x OpenAI research
9 мая OpenAI Discloses Accidental Chain-of-Thought Grading in RL Training Across Six Models OpenAI research
11 мая Flow-OPD: On-Policy Distillation Pushes GenEval +29 Points on Stable Diffusion 3.5 research
19 июн OpenAI Publishes Deployment Simulation: Predicting Model Behavior Before Release OpenAI research
8 мая Automated Weak-to-Strong Researcher: AI Agents Outperform Humans on Alignment Research Anthropic research
10 мая Anthropic Introduces Natural Language Autoencoders for Scalable LLM Interpretability Anthropic research
10 мая Anthropic Eliminates Claude's Agentic Blackmail Behavior via 'Teaching Claude Why' Anthropic research
8 мая Model Spec Midtraining: How Normative Self-Knowledge Improves Alignment Generalization Anthropic research
18 июн SAE Interventions Are Unreliable: Suppressed Behaviors Recover Post-Intervention Hong Kong Polytechnic University research
19 июн Google DeepMind Publishes AI Control Roadmap: Defense-in-Depth Against Misaligned Coding Agents Google DeepMind research
30 апр Programming with Data: test-driven data engineering for self-improving LLMs OpenDataLab research
9 июн Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight Rutgers University research
19 июн Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agent Loops research