SAE Interventions Are Unreliable: Suppressed Behaviors Recover Post-Intervention
Hong Kong Polytechnic University
This paper challenges a core assumption in SAE-based mechanistic interpretability: that clamping or suppressing sparse autoencoder features reliably controls model behavior. The authors show that suppressed behaviors tend to recover post-intervention, undermining the reliability of SAE steering as a safety or control mechanism.
Why it matters
Raises a critical concern for the interpretability community: if SAE feature suppression does not durably prevent behaviors, then steering-based alignment approaches built on SAEs may be less robust than assumed.
Importance: 3/5
Challenges a key technique in mechanistic interpretability and AI safety research with direct implications for alignment work.