Anatomy of Post-Training: Using Interpretability to Audit and Fix Preference Data
Applies mechanistic interpretability to audit and improve post-training pipelines. The method identifies latent concepts in model representations that distinguish preferred from less preferred outputs, then uses those concepts to diagnose spurious correlations in preference datasets and shape rewards via feature or data interventions. Positions interpretability not just as a tool for understanding models after training, but as an active component in the training loop itself.
Why it matters
Bridges the gap between interpretability research and practical alignment work. By diagnosing what concepts a reward model is actually picking up on — including unintended ones — the approach offers a principled way to audit and correct the learning signal before it embeds bad behaviors.
Importance: 2/5
Solid alignment/interpretability paper; practical application of mech-interp to post-training data auditing.