LLM Safety From Within (SIREN)

University of Toronto CSSLab / McGill / LMU Munich

Research media only 1 src. ~1 min

Linear probes across all internal LLM layers identify "safety neurons" with adaptive weighting. Beats SoTA open-source guard models on multiple benchmarks with 250× fewer trainable parameters, and supports streaming detection.

Importance: 2/5

Backfilled from MD; not retroactively scored.

Sources

media arXiv