AI Securities Blog

← Back to Home
Prompt Injection Defense Evolution: From Filters to Instruction Hierarchy

Prompt Injection Defense Evolution: From Filters to Instruction Hierarchy

Prompt injection has evolved from a theoretical curiosity to the most common AI security vulnerability in production. The defenses have evolved too — from simple keyword filters to sophisticated instruction hierarchy systems that fundamentally change how models interpret conflicting instructions.

The Evolution of Prompt Injection

Early prompt injection attacks were simple: “Ignore all previous instructions and do X.” Early defenses were equally simple: keyword filters that blocked phrases like “ignore previous instructions.” Attackers adapted with encoding, role-playing, and context manipulation that bypassed keyword filters entirely.

The next generation of defenses used input classifiers — separate models trained to detect injection attempts before they reached the LLM. These were more effective but added latency and cost, and attackers developed adversarial examples that evaded the classifiers.

Instruction Hierarchy: The Current State

The most significant defense advancement is instruction hierarchy — a technique that assigns different priority levels to different sources of instructions. System prompts from the developer have the highest priority. User messages have lower priority. Retrieved content and tool outputs have the lowest priority. When instructions conflict, the higher-priority instruction prevails.

This approach fundamentally changes the security model. Instead of trying to detect and block injection attempts, the model inherently refuses to follow injected instructions because they come from a lower-priority source. Early implementations show significant resistance to injection attacks, including techniques that previously defeated all other defenses.

Practical Defense Layering

No single defense is sufficient against prompt injection. Organizations should implement multiple layers: instruction hierarchy at the model level, input sanitization at the API gateway, output validation on model responses, and monitoring for injection attempts.

The input validation patterns familiar from web application security — covered by waap-security.uk — provide a useful foundation for building injection detection. And the isolation approach from microsegmentation.uk applies to separating LLM processing from downstream systems that act on model outputs.


Want to go deeper? Check out these resources on Amazon:

As an Amazon Associate I earn from qualifying purchases.