SingGuard: Runtime Policy-Adaptive Multimodal LLM Guardrail with 56K-Example Benchmark

inclusionAI

Research official 2 src. ~1 min

SingGuard is a guardrail model for vision-language models that accepts natural-language safety policies at runtime rather than using rules baked in at training time. It evaluates content against policy rules one-by-one with three inference speed modes (fast/hybrid/slow) to trade interpretability for latency. A new benchmark, SingGuard-Bench, contains 56,340 examples across 80+ risk categories including cross-modal joint-risk cases where neither text nor image alone is harmful but their combination implies unsafe intent. Policy-following accuracy improves from ~64.6% to ~74.1% over prior methods on runtime policy changes.

Why it matters

Most guardrail systems cannot adapt when a product's safety policy changes without retraining. Runtime policy injection makes SingGuard practical across regions or product lines. The cross-modal joint-risk benchmark addresses a gap in existing safety evaluation suites.

Importance: 3/5

HF Daily paper June 29 (30 upvotes); runtime policy-adaptive guardrails address a real deployment gap; 56K-example benchmark contribution

Sources