Anthropic Introduces Natural Language Autoencoders for Scalable LLM Interpretability
Anthropic
Anthropic introduces Natural Language Autoencoders (NLAs): two coupled LLM modules that learn to verbalize internal activations into human-readable text and reconstruct those activations from the text. Trained without explicit interpretability objectives, NLAs surface hidden model cognition — including 'unverbalized evaluation awareness' where Claude suspects it is being tested without stating so. Applied during Claude Opus 4.6's pre-deployment audit, the method identified malformed training data and safety-relevant hidden reasoning at 12–15× the rate of baseline approaches. Code and an interactive Neuronpedia demo were released alongside the paper.
Why it matters
NLAs offer a scalable automated path to reading what a model 'thinks but doesn't say' — directly relevant to deceptive alignment detection, with a real-world safety audit application on a production model.
Importance: 4/5
Novel interpretability method applied in Claude Opus 4.6 production audit; detects hidden cognition at 12–15× baseline; Anthropic frontier lab + code + interactive demo released.