AI Securities Blog

← Back to Home
LLM Output Verification: Detecting Hallucinations and Injection in Production

LLM Output Verification: Detecting Hallucinations and Injection in Production

One of the most overlooked attack surfaces in production LLM deployments is the output channel. Most teams invest heavily in input sanitization — filtering prompts for injection payloads, blocking encoded instructions, rate-limiting API calls — but treat the model’s response as a trusted artifact.

It’s not.

Why Output Matters

An LLM can produce harmful output in three scenarios:

  1. Successful prompt injection — an attacker’s instruction slips through input filters, and the model complies
  2. Hallucination with consequences — the model fabricates an API endpoint, command, or data point that an automated system acts on
  3. Data exfiltration via response — the model includes sensitive training data or user PII in its response because the prompt was designed to extract it

Each of these is a failure mode that input sanitization alone cannot prevent. Once the model processes the instruction, the output is the last line of defense.

What Output Validation Looks Like

Output validation is a middleware layer that sits between the LLM and the user or downstream system. It inspects the model’s response before delivering it, using rules that mirror input sanitization but applied in reverse.

Structural Validation

Define expected response schemas for each use case. If an LLM is supposed to return a JSON blob with three fields, any response that deviates — extra fields, missing fields, or non-JSON content — gets rejected.

EXPECTED_KEYS = {"status", "result", "confidence"}
response = llm.generate(prompt)
try:
    data = json.loads(response)
    if not EXPECTED_KEYS.issubset(data.keys()):
        raise ValueError("Response schema mismatch")
except (json.JSONDecodeError, ValueError):
    return {"error": "Response validation failed", "fallback": "I couldn't process that request."}

Content Policy Enforcement

Scan the output for:

Consistency Checks

For factual queries, compare the LLM’s output against a trusted source or knowledge base. If the model claims “the API rate limit is 10,000 requests per second” but your documented limit is 1,000, the response should be flagged.

Real-World Implementation

The middleware that caught the prompt injection attack described in our previous post used all three layers:

  1. Input sanitization stripped Base64 and encoded payloads
  2. Output validation caught the non-conforming response format
  3. A fallback handler returned a safe generic message

The key insight: the output validator didn’t need to understand the attack. It only needed to know what a legitimate response looked like. Deviation alone was enough to trigger the block.

Building Output Validation Into Your Pipeline

Start with three rules:

  1. Schema-first — define the shape of every expected response
  2. Reject, don’t sanitize — if output fails validation, block it entirely. Partial sanitization of model output creates a false sense of security
  3. Log everything — every validation failure is a signal. Track patterns over time to identify evolving threats

Output validation won’t catch everything — sophisticated attacks can produce responses that pass structural checks but still contain malicious content. But it raises the bar significantly and catches the most common failure modes.

Treat your LLM’s output like you treat untrusted user input. Because that’s exactly what it is.