This paper documents early research conducted in 2022 on defending against prompt injection attacks in large language models, providing historical context for the evolution of the field. We study two adversarial attacks against LLMs — prompt injection and goal hijacking — examining how to construct them, testing them across models, and comparing their effectiveness.
We propose and evaluate Adversarial Fine-Tuning as a defense. Without it, attacks succeeded 31% of the time on GPT-3 series models; with it, attack success dropped to near zero for smaller GPT-3 variants (Ada, Babbage, Curie). We also find that more flexible models are more vulnerable — large models such as GPT-3 Davinci are more susceptible than smaller models like GPT-2. While the specific models tested are now superseded, the core methodology and empirical findings helped lay groundwork for modern prompt injection defenses, including instruction-hierarchy systems and constitutional AI approaches.