Prompt injection remains the most persistent vulnerability in agent systems. But new research suggests the problem may be more tractable than previously thought.
A team at Stanford published results this week showing a layered defense approach that catches 94% of injection attempts while maintaining normal agent functionality. The technique combines input sanitization, output monitoring, and a novel "context anchoring" method.
The Approach
Traditional defenses focus on filtering inputs. Look for suspicious patterns. Block known attack strings. These approaches fail because attackers can always find new patterns.
The Stanford approach adds two additional layers:
Output monitoring. Instead of just checking inputs, the system monitors what the agent actually does. If an agent suddenly tries to access files it's never touched before, or sends data to unknown endpoints, the system flags it.
Context anchoring. The agent maintains a "anchor context" - a summary of its legitimate purpose and constraints. Every action is checked against this anchor. Actions that conflict with the anchor require explicit human approval.
Real-World Performance
In testing against a corpus of 10,000 known injection attacks plus 5,000 novel attacks generated by red-team models, the system blocked 94.2% of attacks. False positives were under 2%.
The remaining 5.8% of successful attacks were mostly sophisticated multi-step injections that gradually shifted context over many interactions. The researchers note this class of attack is harder to execute in practice and propose additional mitigations.
Practical Implementation
The techniques don't require model changes. They can be implemented as wrappers around existing agents. Several open-source implementations are already available.
For agent developers, the takeaway is clear: defense in depth works. Input filtering alone isn't enough. Monitor outputs. Anchor context. Make attackers work harder.
The cat-and-mouse game continues, but defenders are catching up.