SAE Interventions Are Unreliable: Suppressed Behaviors Recover Post-Intervention

Hong Kong Polytechnic University

Research official 1 src. ~1 min

This paper challenges a core assumption in SAE-based mechanistic interpretability: that clamping or suppressing sparse autoencoder features reliably controls model behavior. The authors show that suppressed behaviors tend to recover post-intervention, undermining the reliability of SAE steering as a safety or control mechanism.

Why it matters

Raises a critical concern for the interpretability community: if SAE feature suppression does not durably prevent behaviors, then steering-based alignment approaches built on SAEs may be less robust than assumed.

Importance: 3/5

Challenges a key technique in mechanistic interpretability and AI safety research with direct implications for alignment work.

interpretability safety sparse-autoencoders alignment paper

Sources

official SAE Interventions are Unreliable on HuggingFace Daily Papers