Guard Models Lose Safety When Fine-Tuned on Benign Data
Safety classifiers like LlamaGuard and WildGuard collapse under standard domain specialization, even on benign datasets — not from adversarial attack, but from destruction of the latent geometry that separates harmful from safe outputs.
Granite Guardian's refusal rate dropped from 85% to 0% after fine-tuning. Researchers traced the failure to loss of per-layer safety subspaces (extracted via SVD), showing the harmful–benign representational boundary dissolves during routine adaptation.
Implication: guard models in agentic pipelines are brittle to the very specialization they're designed to undergo, raising questions about deployment robustness when these classifiers are tuned for domain-specific tasks.