LLMs that write insecure code can often correctly explain the very vulnerability they just introduced — a gap we call the "Format-Reliability Gap." Mechanistic analysis traces this to a single layer where format-compliance demands crowd out otherwise-present security representations. Because the failure is localized, per-vulnerability steering vectors reduce insecure generation by up to 74% with negligible overhead.
Recent Work
A study of prompt injection and goal hijacking against GPT-3-era models, introducing Adversarial Fine-Tuning as a defense. Without it, attacks succeeded 31% of the time; with it, attack success dropped to near zero for smaller GPT-3 variants. We also find more flexible models are more vulnerable — large models like GPT-3 Davinci more so than GPT-2.
Large Language Models (LLMs) such as OpenAI Codex are increasingly being used as AI-based coding assistants. We conducted a security-driven user study (N=58) to assess code written by student programmers when assisted by LLMs. Our results indicate that in low-level C programming with pointer and array manipulations, the security impact is small: AI-assisted users produce critical security bugs at a rate no greater than 10% more than the control, indicating the use of LLMs does not introduce new security risks.
Recent News
- April 2026: New preprint on surgically repairing insecure code generation in LLMs is out
- September 2025: Released our study on early adversarial fine-tuning for prompt injection defense
- August 2023: Presented “Lost at C” at USENIX Security ‘23
Recent Blog Posts
Mechanistic Interpretability as a Security Tool Date: June 22, 2026 Estimated Reading Time: 10 min Author: Gustavo Sandoval Mechanistic interpretability usually gets pitched as a science project, a way to understand in principle what’s happening inside a neural network. I want to make a narrower, more practical case for it:...
Prompt Injection, 2022 vs Today: A Retrospective Date: June 20, 2026 Estimated Reading Time: 11 min Author: Gustavo Sandoval People like to call prompt injection the SQL injection of LLMs, and the comparison holds up better than most. Untrusted input and trusted instructions travel down the same channel, and the...
The Format-Reliability Gap: Diagnosing and Repairing Insecure Code Generation Date: June 18, 2026 Estimated Reading Time: 10 min Author: Gustavo Sandoval Ask an LLM to generate a piece of code and it might hand you a SQL injection or a buffer overflow. Ask the same model, a minute later, “does...