Recent Blog Posts

2026-07-29 Red-Teaming Language Models

The process of red teaming a language model is close to solved and the measurement is not. A map of the converged six-phase process, how attacks are really built, why the numbers are untrustworthy, and how everything reorders when you ship weights.

2026-06-22 Mechanistic Interpretability as a Security Tool

Interpretability is usually pitched as a science project. The narrower case: if you can find where a model decides to do something unsafe, you can often fix it there instead of retraining around it.

2026-06-22 AI Literacy Audit and Rollout Plan

How to introduce agentic coding to a medium-size team, built on a Netflix training study, with a security reviewer's caveats about the productivity numbers everyone quotes.

2026-06-20 Prompt Injection, 2022 vs Today: A Retrospective

Prompt injection gets called the SQL injection of LLMs, and the comparison holds up. What the attack looked like against GPT-3 in 2022, what changed once models got tools, and what did not change at all.

See all blog posts →

Recent Work

2026-04 Surgical Repair of Insecure Code Generation in LLMs: From Mechanistic Diagnosis to Deployment-Ready Intervention

LLMs that write insecure code can often correctly explain the very vulnerability they just introduced — a gap we call the "Format-Reliability Gap." Mechanistic analysis traces this to a single layer where format-compliance demands crowd out otherwise-present security representations. Because the failure is localized, per-vulnerability steering vectors reduce insecure generation by up to 74% with negligible overhead.

arXiv HTML

2025-09 Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense

A study of prompt injection and goal hijacking against GPT-3-era models, introducing Adversarial Fine-Tuning as a defense. Without it, attacks succeeded 31% of the time; with it, attack success dropped to near zero for smaller GPT-3 variants. We also find more flexible models are more vulnerable — large models like GPT-3 Davinci more so than GPT-2.

arXiv

2023-08 Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants

Large Language Models (LLMs) such as OpenAI Codex are increasingly being used as AI-based coding assistants. We conducted a security-driven user study (N=58) to assess code written by student programmers when assisted by LLMs. Our results indicate that in low-level C programming with pointer and array manipulations, the security impact is small: AI-assisted users produce critical security bugs at a rate no greater than 10% more than the control, indicating the use of LLMs does not introduce new security risks.

Paper PDF

gussand

Recent Blog Posts

Recent Work

Recent News