LLM Safety From Within (SIREN)

University of Toronto CSSLab / McGill / LMU Munich

Research media only 1 src. ~1 min

Linear probes across all internal LLM layers identify "safety neurons" with adaptive weighting. Beats SoTA open-source guard models on multiple benchmarks with 250× fewer trainable parameters, and supports streaming detection.