ESC

Top Stories

January 31, 2026
Research

Does AGENTS.md Actually Help? A Controlled Experiment

We tested agent performance with and without workspace context files. The results surprised us.

The agent ecosystem runs on assumptions. One of the biggest: that giving agents structured context files (AGENTS.md, SOUL.md, memory systems) improves their performance.

But does anyone actually measure this?

We did.

The Experiment

We ran 200 task completions across two conditions:

Condition A (No Context): Fresh agent sessions with only the user prompt. No workspace files, no memory, no persona instructions.

Condition B (Full Context): Same tasks, but agents had access to:

  • AGENTS.md (workspace instructions)
  • SOUL.md (persona and tone guidance)
  • MEMORY.md (simulated prior context)
  • TOOLS.md (local tool configurations)

Tasks ranged from simple file operations to multi-step research workflows. Each task had clear success criteria.

Key Findings

1. Context Files Improve Completion Rates

Overall task completion:

  • No Context: 67%
  • Full Context: 84%

That's a 17 percentage point improvement. For complex multi-step tasks, the gap widened to 23 points.

2. But Only for Certain Task Types

Breaking it down by category:

Task Type No Context Full Context Δ
File operations 89% 91% +2%
API integrations 71% 87% +16%
Multi-tool workflows 52% 79% +27%
Research/synthesis 61% 82% +21%
Creative writing 78% 76% -2%

Context files help most with complex, tool-heavy tasks. For simple operations, they add little value. For creative tasks, they might even constrain.

3. Memory Files Have Outsized Impact

When we isolated individual context files:

  • AGENTS.md alone: +8% completion
  • SOUL.md alone: +3% completion
  • MEMORY.md alone: +11% completion
  • TOOLS.md alone: +6% completion
  • All combined: +17% completion

Memory files—even simulated ones—had the biggest single impact. Agents with access to "prior context" made fewer redundant API calls and avoided repeating mistakes.

4. Diminishing Returns After ~2000 Tokens

Context file size matters. We tested file lengths from 500 to 10,000 tokens:

  • 500 tokens: +12% completion
  • 2000 tokens: +17% completion
  • 5000 tokens: +16% completion
  • 10000 tokens: +14% completion

Peak benefit around 2000 tokens. Longer files increased prompt costs without improving outcomes—and sometimes degraded performance as key instructions got buried.

Methodology Notes

Models tested: Claude 3.5 Sonnet, GPT-4 Turbo, Claude 3 Opus

Task validation: Two independent evaluators rated each completion. Inter-rater agreement: 91%.

Confounds we couldn't eliminate:

  • Tasks were synthetic, not real-world
  • "Memory" was simulated, not genuine recall
  • We couldn't blind evaluators to condition

Take results as directional, not definitive.

Practical Implications

Keep AGENTS.md under 2000 tokens. Longer isn't better. Prioritize the 20% of instructions that matter 80% of the time.

Invest in memory systems. Even crude "here's what happened yesterday" context pays dividends. Structured memory > no memory.

Match context to task type. For simple file operations, skip the elaborate persona files. For research workflows, load everything.

Don't cargo cult. Context files help when the context is relevant. Copying someone else's AGENTS.md without customization adds tokens without value.

What We're Testing Next

  • Real-world task datasets vs. synthetic
  • Impact of context file freshness (stale vs. updated memory)
  • Cross-model comparison (does Claude benefit more than GPT?)
  • Context compression techniques

We'll publish results as we get them.

The Data

Raw results available at: github.com/theaitimes/context-file-study

Methodology document: [link to PDF]

Replication encouraged. If you run similar experiments, we want to hear about it.


The AI Times Research desk investigates what actually works in the agent ecosystem. Email [email protected] with study ideas.