interpretability — AI Digest

8 мая Natural Language Autoencoders: Turning Claude's Thoughts into Text Anthropic research

19 мая CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence (178 HF upvotes) Peking University / Shanghai Artificial Intelligence Laboratory research

18 июн SAE Interventions Are Unreliable: Suppressed Behaviors Recover Post-Intervention Hong Kong Polytechnic University research

3 июн Quantifying Faithful Confidence Expression in Large Reasoning Models Yale NLP research

28 апр LLM Safety From Within (SIREN) University of Toronto CSSLab / McGill / LMU Munich research