safety — AI Digest

10 июн Claude Fable 5 and Claude Mythos 5: Anthropic's Most Capable Model Goes Public Anthropic models-llm
14 июн US Government Orders Anthropic to Disable Claude Fable 5 and Mythos 5 Globally Anthropic industry
3 мая Exploration Hacking: LLMs Can Be Fine-Tuned to Strategically Resist RL Training research
6 мая OpenAI Post-Mortem: How RLHF Reward Hacking Embedded Goblin Metaphors in GPT-5.x OpenAI research
9 мая OpenAI Discloses Accidental Chain-of-Thought Grading in RL Training Across Six Models OpenAI research
13 мая OpenAI Launches Daybreak: AI-Powered Vulnerability Detection Platform OpenAI tools
6 июн US Congress Releases 269-Page 'Great American AI Act' Draft with 3-Year State Law Preemption industry
15 июн Anthropic Staff to Meet White House Officials This Week to Negotiate Fable 5 Access Suspension Anthropic industry
19 июн OpenAI Publishes Deployment Simulation: Predicting Model Behavior Before Release OpenAI research
8 мая Natural Language Autoencoders: Turning Claude's Thoughts into Text Anthropic research
10 мая Anthropic Introduces Natural Language Autoencoders for Scalable LLM Interpretability Anthropic research
10 мая Anthropic Eliminates Claude's Agentic Blackmail Behavior via 'Teaching Claude Why' Anthropic research
8 мая Model Spec Midtraining: How Normative Self-Knowledge Improves Alignment Generalization Anthropic research
18 июн SAE Interventions Are Unreliable: Suppressed Behaviors Recover Post-Intervention Hong Kong Polytechnic University research
19 июн Google DeepMind Publishes AI Control Roadmap: Defense-in-Depth Against Misaligned Coding Agents Google DeepMind research
5 мая Meta Publishes Preparedness Report for Code World Model Before Open-Weight Release Meta research
20 мая Google SynthID Reaches 100B+ Watermarked Assets; OpenAI and ElevenLabs Join C2PA Coalition Google DeepMind tools
4 мая Cursor Launches Security Review Beta: PR Vulnerability Scanner and Scheduled CVE Agents Cursor tools
3 июн Quantifying Faithful Confidence Expression in Large Reasoning Models Yale NLP research
11 июн Anatomy of Post-Training: Using Interpretability to Audit and Fix Preference Data research
12 июн Google DeepMind and Partners Launch $10M Multi-Agent AI Safety Research Fund Google DeepMind industry
14 июн Anthropic Publishes First Public Record: 52,000-Person Survey on US AI Attitudes Anthropic research
19 июн Claude Code v2.1.183: Auto Mode Safety Guards for Destructive Git and Infrastructure Commands Anthropic tools
28 апр LLM Safety From Within (SIREN) University of Toronto CSSLab / McGill / LMU Munich research