Large language models write production code, yet they routinely introduce well-known vulnerabilities. We show this is not a knowledge deficit: the same models that generate insecure code correctly identify and explain the vulnerability when asked directly. We call this the Format-Reliability Gap.

Mechanistic analysis reveals the cause: security-relevant representations are encoded from the earliest layers but remain computationally inert until the final layer, where format-compliance demands compete with them. Because the failure is localized to a single layer, per-vulnerability steering vectors reduce insecure generation by up to 74% with negligible overhead — a targeted intervention rather than broad retraining.

arXiv:2604.16697 · HTML