#interpretability
- Natural Language Autoencoders: Turning Claude's Thoughts into Text Anthropic research
- Anthropic Introduces Natural Language Autoencoders for Scalable LLM Interpretability Anthropic research
- Judge Circuits: Mechanistic Explanation of LLM-as-Judge Format Inconsistency research
- CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence (178 HF upvotes) Peking University / Shanghai Artificial Intelligence Laboratory research
- SAE Interventions Are Unreliable: Suppressed Behaviors Recover Post-Intervention Hong Kong Polytechnic University research
- Quantifying Faithful Confidence Expression in Large Reasoning Models Yale NLP research
- Anatomy of Post-Training: Using Interpretability to Audit and Fix Preference Data research
- LLM Safety From Within (SIREN) University of Toronto CSSLab / McGill / LMU Munich research