#safety
- Claude Fable 5 and Claude Mythos 5: Anthropic's Most Capable Model Goes Public Anthropic models-llm
- US Government Orders Anthropic to Disable Claude Fable 5 and Mythos 5 Globally Anthropic industry
- Exploration Hacking: LLMs Can Be Fine-Tuned to Strategically Resist RL Training research
- OpenAI Post-Mortem: How RLHF Reward Hacking Embedded Goblin Metaphors in GPT-5.x OpenAI research
- OpenAI Discloses Accidental Chain-of-Thought Grading in RL Training Across Six Models OpenAI research
- OpenAI Launches Daybreak: AI-Powered Vulnerability Detection Platform OpenAI tools
- US Congress Releases 269-Page 'Great American AI Act' Draft with 3-Year State Law Preemption industry
- Anthropic Staff to Meet White House Officials This Week to Negotiate Fable 5 Access Suspension Anthropic industry
- OpenAI Publishes Deployment Simulation: Predicting Model Behavior Before Release OpenAI research
- Natural Language Autoencoders: Turning Claude's Thoughts into Text Anthropic research
- Anthropic Introduces Natural Language Autoencoders for Scalable LLM Interpretability Anthropic research
- Anthropic Eliminates Claude's Agentic Blackmail Behavior via 'Teaching Claude Why' Anthropic research
- Model Spec Midtraining: How Normative Self-Knowledge Improves Alignment Generalization Anthropic research
- SAE Interventions Are Unreliable: Suppressed Behaviors Recover Post-Intervention Hong Kong Polytechnic University research
- Google DeepMind Publishes AI Control Roadmap: Defense-in-Depth Against Misaligned Coding Agents Google DeepMind research
- Meta Publishes Preparedness Report for Code World Model Before Open-Weight Release Meta research
- Google SynthID Reaches 100B+ Watermarked Assets; OpenAI and ElevenLabs Join C2PA Coalition Google DeepMind tools
- Cursor Launches Security Review Beta: PR Vulnerability Scanner and Scheduled CVE Agents Cursor tools
- Quantifying Faithful Confidence Expression in Large Reasoning Models Yale NLP research
- Anatomy of Post-Training: Using Interpretability to Audit and Fix Preference Data research
- Google DeepMind and Partners Launch $10M Multi-Agent AI Safety Research Fund Google DeepMind industry
- Anthropic Publishes First Public Record: 52,000-Person Survey on US AI Attitudes Anthropic research
- Claude Code v2.1.183: Auto Mode Safety Guards for Destructive Git and Infrastructure Commands Anthropic tools
- LLM Safety From Within (SIREN) University of Toronto CSSLab / McGill / LMU Munich research