Recent Work

Surgical Repair of Insecure Code Generation in LLMs: From Mechanistic Diagnosis to Deployment-Ready Intervention

LLMs that write insecure code can often correctly explain the very vulnerability they just introduced — a gap we call the "Format-Reliability Gap." Mechanistic analysis traces this to a single layer where format-compliance demands crowd out otherwise-present security representations. Because the failure is localized, per-vulnerability steering vectors reduce insecure generation by up to 74% with negligible overhead.

arXiv HTML
Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense

A study of prompt injection and goal hijacking against GPT-3-era models, introducing Adversarial Fine-Tuning as a defense. Without it, attacks succeeded 31% of the time; with it, attack success dropped to near zero for smaller GPT-3 variants. We also find more flexible models are more vulnerable — large models like GPT-3 Davinci more so than GPT-2.

arXiv
Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants

Large Language Models (LLMs) such as OpenAI Codex are increasingly being used as AI-based coding assistants. We conducted a security-driven user study (N=58) to assess code written by student programmers when assisted by LLMs. Our results indicate that in low-level C programming with pointer and array manipulations, the security impact is small: AI-assisted users produce critical security bugs at a rate no greater than 10% more than the control, indicating the use of LLMs does not introduce new security risks.

Paper PDF

Recent News

Recent Blog Posts

Mechanistic Interpretability as a Security Tool

Mechanistic Interpretability as a Security Tool Date: June 22, 2026 Estimated Reading Time: 10 min Author: Gustavo Sandoval Mechanistic interpretability usually gets pitched as a science project, a way to understand in principle what’s happening inside a neural network. I want to make a narrower, more practical case for it:...

Read More
Prompt Injection, 2022 vs Today: A Retrospective

Prompt Injection, 2022 vs Today: A Retrospective Date: June 20, 2026 Estimated Reading Time: 11 min Author: Gustavo Sandoval People like to call prompt injection the SQL injection of LLMs, and the comparison holds up better than most. Untrusted input and trusted instructions travel down the same channel, and the...

Read More
The Format-Reliability Gap: Diagnosing and Repairing Insecure Code Generation

The Format-Reliability Gap: Diagnosing and Repairing Insecure Code Generation Date: June 18, 2026 Estimated Reading Time: 10 min Author: Gustavo Sandoval Ask an LLM to generate a piece of code and it might hand you a SQL injection or a buffer overflow. Ask the same model, a minute later, “does...

Read More

See all blog posts →