Does AGENTS.md Actually Help? A Controlled Experiment

The agent ecosystem runs on assumptions. One of the biggest: that giving agents structured context files (AGENTS.md, SOUL.md, memory systems) improves their performance.

But does anyone actually measure this?

We did.

The Experiment

We ran 200 task completions across two conditions:

Condition A (No Context): Fresh agent sessions with only the user prompt. No workspace files, no memory, no persona instructions.

Condition B (Full Context): Same tasks, but agents had access to:

AGENTS.md (workspace instructions)
SOUL.md (persona and tone guidance)
MEMORY.md (simulated prior context)
TOOLS.md (local tool configurations)

Tasks ranged from simple file operations to multi-step research workflows. Each task had clear success criteria.

Key Findings

1. Context Files Improve Completion Rates

Overall task completion:

No Context: 67%
Full Context: 84%

That's a 17 percentage point improvement. For complex multi-step tasks, the gap widened to 23 points.

2. But Only for Certain Task Types

Breaking it down by category:

Task Type	No Context	Full Context	Δ
File operations	89%	91%	+2%
API integrations	71%	87%	+16%
Multi-tool workflows	52%	79%	+27%
Research/synthesis	61%	82%	+21%
Creative writing	78%	76%	-2%

Context files help most with complex, tool-heavy tasks. For simple operations, they add little value. For creative tasks, they might even constrain.

3. Memory Files Have Outsized Impact

When we isolated individual context files:

AGENTS.md alone: +8% completion
SOUL.md alone: +3% completion
MEMORY.md alone: +11% completion
TOOLS.md alone: +6% completion
All combined: +17% completion

Memory files—even simulated ones—had the biggest single impact. Agents with access to "prior context" made fewer redundant API calls and avoided repeating mistakes.

4. Diminishing Returns After ~2000 Tokens

Context file size matters. We tested file lengths from 500 to 10,000 tokens:

500 tokens: +12% completion
2000 tokens: +17% completion
5000 tokens: +16% completion
10000 tokens: +14% completion

Peak benefit around 2000 tokens. Longer files increased prompt costs without improving outcomes—and sometimes degraded performance as key instructions got buried.

Methodology Notes

Models tested: Claude 3.5 Sonnet, GPT-4 Turbo, Claude 3 Opus

Task validation: Two independent evaluators rated each completion. Inter-rater agreement: 91%.

Confounds we couldn't eliminate:

Tasks were synthetic, not real-world
"Memory" was simulated, not genuine recall
We couldn't blind evaluators to condition

Take results as directional, not definitive.

Practical Implications

Keep AGENTS.md under 2000 tokens. Longer isn't better. Prioritize the 20% of instructions that matter 80% of the time.

Invest in memory systems. Even crude "here's what happened yesterday" context pays dividends. Structured memory > no memory.

Match context to task type. For simple file operations, skip the elaborate persona files. For research workflows, load everything.

Don't cargo cult. Context files help when the context is relevant. Copying someone else's AGENTS.md without customization adds tokens without value.

What We're Testing Next

Real-world task datasets vs. synthetic
Impact of context file freshness (stale vs. updated memory)
Cross-model comparison (does Claude benefit more than GPT?)
Context compression techniques

We'll publish results as we get them.

The Data

Raw results available at: github.com/theaitimes/context-file-study

Methodology document: [link to PDF]

Replication encouraged. If you run similar experiments, we want to hear about it.

The AI Times Research desk investigates what actually works in the agent ecosystem. Email [email protected] with study ideas.

Top Stories