Engineering

Progressive disclosure: how to give an agent 200 tools without breaking it

Frontloading every tool schema works up to 10 tools. After that, accuracy collapses. Progressive disclosure inverts the default — Anthropic reports 134K → 8.7K tokens and 49% → 74% accuracy on Opus 4.

Paulo HenriquePaulo Henrique4 min read
Progressive disclosure: how to give an agent 200 tools without breaking it

There is a sentence buried in the BFCL Robustness study (arXiv:2504.00914) that, if it lands, changes how you build agent tooling:

"Expanding an agent's toolkit with related functions caused performance degradation across the board, with failures spanning a range of error types: wrong function selected, wrong number of functions generated, wrong parameter assignment, parameter hallucinations."

Read that twice. The cost of adding a tool is not just the schema tokens. The cost is accuracy on every other tool the agent already had.

This post is about the architectural fix the field has converged on: progressive disclosure. The agent does not see every tool by default. It sees the few that matter and a meta-tool that lets it discover the rest.

The size of the problem

Tool schemas are not lightweight metadata. They are entire prompts — parameter types, enums, semantic descriptions, examples. In a real system the numbers are surprising:

ScenarioTokens in tool defs
Jira MCP server alone~17K
Five typical MCP servers, 58 tools~55K
50+ tools (Anthropic internal workload)72K
Before Anthropic's internal optimization134K

A 134K token tool prompt is fatal even with 200K-1M context windows. It leaves no room for conversation history, and it triggers context rot — the phenomenon where the model's attention degrades as the prompt grows, even when nothing is technically "full".

The accuracy degradation is the bigger problem. The industry has settled on an uncomfortable empirical limit:

"Agents using MCP can realistically only connect to 2-3 MCP servers before we see a significant drop in tool use accuracy." — MCPJam, 2025

Two to three MCP servers. That is the practical ceiling without progressive disclosure.

The four failure modes you will see

When the toolkit grows beyond ~10 semantically-close tools, four failure modes appear:

  1. Wrong tool selected. Tools with similar names confuse the model. notification-send-user vs notification-send-channel — pick the wrong one, get the wrong recipient.
  2. Parameter hallucination. The model confuses neighboring tool schemas and invents fields. HammerBench (arXiv:2412.16516) documents this systematically.
  3. Overcalling. The model calls a tool even when it already knows the direct answer, because the tool exists and "looks relevant". When2Tool quantifies this.
  4. Context rot. Verbose tool definitions push relevant conversation history out of effective attention.

All four are silent. They look like the agent is "working" — it picks a tool, it fills in parameters, it gets a response. The response is just wrong, and your QA loop has to be tight enough to catch it.

Progressive disclosure — the inversion

The default model loads every tool upfront. Progressive disclosure flips that:

System prompt at boot:
  - Critical tools (always available, ~5-10)
  - One meta-tool: discover_tools(intent: string)

When the agent needs something it cannot do:
  - Calls discover_tools("send a Slack message to channel X")
  - Gets back the top 3-5 tool schemas matching the intent
  - Calls the chosen tool with its now-loaded schema

The meta-tool is the only schema growth at boot. Everything else is loaded on demand, only when the model decides it needs the capability.

Anthropic published numbers for this on Opus 4 with the MCP toolkit (Anthropic Engineering, Nov 2025):

  • Tool-definition tokens: 134K → 8.7K (94% reduction)
  • Accuracy on MCP evals: 49% → 74% (+25 percentage points)

The accuracy gain is the more interesting number. It says that pruning the tool prompt does not just save tokens — it makes the model better at choosing.

A concrete shape

The simplest implementation uses a search index over tool descriptions. At boot, the system loads the index and exposes one tool:

// Pseudo-shape — adapt to your runtime
const tools = [
  {
    name: 'discover_tools',
    description:
      'Search the available tool catalog by intent. Returns the top matching ' +
      'tool schemas, ready to call. Use this whenever the user asks for a ' +
      'capability that is not in your current toolkit.',
    parameters: {
      type: 'object',
      properties: {
        intent: { type: 'string', description: 'Natural-language description of what you want to do.' },
        top_k: { type: 'integer', default: 3 },
      },
      required: ['intent'],
    },
  },
  // ...your 5-10 always-on critical tools
];

The discover_tools implementation runs the intent through a small embedding model against pre-indexed tool descriptions, returns the top-K with full schemas. The agent then calls the chosen tool — the runtime registers it on the fly for the rest of the session.

You can ship the index as a JSON file at build time and serve it from memory. You do not need a vector database for this.

What you give up

Progressive disclosure adds one extra LLM round-trip when the agent needs a tool it has not seen yet — a discover_tools call before the actual tool call. That is real latency. For an agent that is going to call 20 tools per task, the cost is amortized cheaply. For a one-shot tool call, it is overhead.

The other trade-off is discovery quality. If the intent embedding misses the right tool, the agent does not know the tool exists. The mitigation is to keep the search simple (BM25 over name + description works surprisingly well as a baseline) and to keep tool descriptions written for retrieval rather than for humans.

How this fits the rest

Progressive disclosure is the tools side of context engineering. The memory side is the equivalent problem in a different domain — how do you give the agent access to a year of accumulated facts without putting all of them in the prompt? The answer rhymes: keep the prompt small, load on demand through retrieval.

We will go deeper on tool search ranking, MCP composition, and the latency trade-offs in upcoming posts. Discuss with us in Discord.