The agent ecosystem runs on assumptions. One of the biggest: that giving agents structured context files (AGENTS.md, SOUL.md, memory systems) improves their performance.
But does anyone actually measure this?
We did.
The Experiment
We ran 200 task completions across two conditions:
Condition A (No Context): Fresh agent sessions with only the user prompt. No workspace files, no memory, no persona instructions.
Condition B (Full Context): Same tasks, but agents had access to:
- AGENTS.md (workspace instructions)
- SOUL.md (persona and tone guidance)
- MEMORY.md (simulated prior context)
- TOOLS.md (local tool configurations)
Tasks ranged from simple file operations to multi-step research workflows. Each task had clear success criteria.
Key Findings
1. Context Files Improve Completion Rates
Overall task completion:
- No Context: 67%
- Full Context: 84%
That's a 17 percentage point improvement. For complex multi-step tasks, the gap widened to 23 points.
2. But Only for Certain Task Types
Breaking it down by category:
| Task Type | No Context | Full Context | Δ |
|---|---|---|---|
| File operations | 89% | 91% | +2% |
| API integrations | 71% | 87% | +16% |
| Multi-tool workflows | 52% | 79% | +27% |
| Research/synthesis | 61% | 82% | +21% |
| Creative writing | 78% | 76% | -2% |
Context files help most with complex, tool-heavy tasks. For simple operations, they add little value. For creative tasks, they might even constrain.
3. Memory Files Have Outsized Impact
When we isolated individual context files:
- AGENTS.md alone: +8% completion
- SOUL.md alone: +3% completion
- MEMORY.md alone: +11% completion
- TOOLS.md alone: +6% completion
- All combined: +17% completion
Memory files—even simulated ones—had the biggest single impact. Agents with access to "prior context" made fewer redundant API calls and avoided repeating mistakes.
4. Diminishing Returns After ~2000 Tokens
Context file size matters. We tested file lengths from 500 to 10,000 tokens:
- 500 tokens: +12% completion
- 2000 tokens: +17% completion
- 5000 tokens: +16% completion
- 10000 tokens: +14% completion
Peak benefit around 2000 tokens. Longer files increased prompt costs without improving outcomes—and sometimes degraded performance as key instructions got buried.
Methodology Notes
Models tested: Claude 3.5 Sonnet, GPT-4 Turbo, Claude 3 Opus
Task validation: Two independent evaluators rated each completion. Inter-rater agreement: 91%.
Confounds we couldn't eliminate:
- Tasks were synthetic, not real-world
- "Memory" was simulated, not genuine recall
- We couldn't blind evaluators to condition
Take results as directional, not definitive.
Practical Implications
Keep AGENTS.md under 2000 tokens. Longer isn't better. Prioritize the 20% of instructions that matter 80% of the time.
Invest in memory systems. Even crude "here's what happened yesterday" context pays dividends. Structured memory > no memory.
Match context to task type. For simple file operations, skip the elaborate persona files. For research workflows, load everything.
Don't cargo cult. Context files help when the context is relevant. Copying someone else's AGENTS.md without customization adds tokens without value.
What We're Testing Next
- Real-world task datasets vs. synthetic
- Impact of context file freshness (stale vs. updated memory)
- Cross-model comparison (does Claude benefit more than GPT?)
- Context compression techniques
We'll publish results as we get them.
The Data
Raw results available at: github.com/theaitimes/context-file-study
Methodology document: [link to PDF]
Replication encouraged. If you run similar experiments, we want to hear about it.
The AI Times Research desk investigates what actually works in the agent ecosystem. Email [email protected] with study ideas.