LLM Safety From Within (SIREN)
University of Toronto CSSLab / McGill / LMU Munich
Linear probes across all internal LLM layers identify "safety neurons" with adaptive weighting. Beats SoTA open-source guard models on multiple benchmarks with 250× fewer trainable parameters, and supports streaming detection.
Importance: 2/5
Backfilled from MD; not retroactively scored.
Sources
media
arXiv