I cut my MCP server from 8 tools to 5 and the hallucinations stopped
Three weeks of tool-count post-mortem on @gitdealflow/mcp-signal. Why REST endpoints aren't user intents, why two of my tool names were costing me selection accuracy, and the data on what changed.
Key Takeaway
Shipped @gitdealflow/mcp-signal with 8 tools mirroring REST endpoints. Watched Claude hallucinate parameters, pick the wrong tool, and stitch together 5-call chains for single-intent prompts. Three weeks of cuts later: 5 tools, two renamed for verb-clarity, selection accuracy from 66% to 98.5%, average tool calls per intent from 2.4 to 1.1. The heuristic: design one tool per user intent, not one per resource. REST endpoints are an implementation detail; intents are what the model selects against.
I shipped @gitdealflow/mcp-signal in two hours. Eight tools, mirrored one-to-one off our REST API. Felt clean. Looked clean. The first time I plugged it into Claude and asked "what's the trending startup in fintech this week," it called `get_startup` with a `sector` parameter that doesn't exist, hallucinated a result, and confidently quoted me numbers nobody had ever calculated.
I wasn't building an MCP server. I was building a very expensive random-number generator with a JSON wrapper.
The next three weeks were a tool-count post-mortem. The end state was 5 tools, two of them renamed for verb-clarity, and a selection accuracy that went from "wrong about a third of the time" to "I genuinely cannot remember the last time it picked the wrong one." Here is what actually mattered.
The 8 tools that didn't work#
The starting menu, copied straight from the REST routes:
``` list_startups get_startup list_signals get_signal get_methodology get_trending list_sectors get_sector ```
Reasonable on paper. Each tool described a real capability. The schemas were valid. The descriptions read like decent docstrings. I was even proud of how cleanly the surface mapped onto the API.
The problem became obvious the first time I watched a real conversation. The user asked something agentic — "show me the top fintech startups this week, and tell me what makes them interesting." A single intent, a single ranked list. Claude did this:
- Called `list_sectors` (probably to confirm "fintech" is a valid sector)
- Called `list_startups` with a `sector` parameter (does not exist on `list_startups`, schema rejected it)
- Retried with a different parameter shape
- Eventually gave up and called `get_trending`
- Made up a `top_n` parameter that does not exist either
- Returned a "this is what is trending" answer that was actually four random startups from cache
Five tool calls for a single intent. Three of them with hallucinated parameters. Zero of them returned the data the user actually wanted.
This is the part nobody talks about when they say "MCP just works." It works in demos because demo prompts map cleanly to one tool. It stops working the moment a user asks something composite — which is most user prompts.
What the model is actually doing#
I had been thinking about MCP tools as endpoints. The model thinks about them as items on a menu it has to read every single turn.
Eight tools means eight schemas in context. Each schema includes a description, a parameter list, parameter types, parameter descriptions, return type, return description. Even with terse docstrings, that runs ~600 tokens per tool. Eight tools ≈ 5,000 tokens of menu before the user has said anything.
Worse, the model has to hold all eight in working memory while it picks one. The picking process is essentially a vector-similarity beauty contest: the user prompt's embedding against each tool description's embedding. If two tools have descriptions that share half their vocabulary — `list_startups` and `get_startup`, say, both heavy on the word "startup" — the model's confidence between them collapses to something close to a coin flip.
Most "the AI hallucinated a tool call" stories I have heard in the last quarter are this exact failure. Not a model failure. A menu design failure.
The cuts#
Three tools got dropped in the first pass. Two more got renamed.
**`list_startups` and `get_startup` were collapsed.** The model was confusing them on every other turn. The tell: when I logged the model's reasoning, it would describe what it wanted as "a list-style get of startups in fintech" — which is actually `list_startups(sector="fintech")`. But it kept calling `get_startup` with a `sector` parameter, because the names were too close.
I killed `get_startup` entirely. If you want a single startup, you call the list tool with a filter and `limit=1`. The single-resource endpoint had been costing me selection accuracy without buying any real capability.
**`list_sectors` and `get_sector` went the same way.** Almost nobody — including the model — wanted a single sector. They wanted a list to pick from, which used to be `list_sectors`'s job. I rolled both into a single tool I named `search_startups_by_sector` — a verb-noun-prepositional-phrase shape that the model parses extremely cleanly. "Find me fintech startups" → unambiguous match.
**`list_signals` got renamed to `get_startup_signal`.** This was the subtle one. The user almost never says "signals" in a prompt — they say "what's the engineering activity look like for X" or "is this team building." The word "signals" is internal jargon. The rename made the model start picking the right tool on prompts that did not even contain the word signal, because it parsed "startup" from context and matched on that.
**`get_trending` got renamed to `get_trending_startups`.** Same idea. Verb-adjective-noun where the noun is a word the user actually said is a much stronger lock than verb-adjective alone.
The 5 that work#
``` get_trending_startups search_startups_by_sector get_startup_signal get_signals_summary get_methodology ```
Two things worth calling out beyond the renames.
**`get_signals_summary` is a new tool.** It does not have a 1:1 REST endpoint. It exists because users kept asking "give me a one-paragraph summary of what's interesting this week" and the model kept stitching together three calls to fake it. I built the summary tool. The model now makes one call.
That last point is the heuristic I would give anyone shipping an MCP server: look at the actual conversational intents your users have, and design one tool per intent. Resources are an implementation detail. Intents are what the model is selecting against.
**Verb-noun, not noun-noun.** Even the kept tools got their names re-checked against this rule. `get_methodology` survived because users do say "methodology" — but if I noticed selection drift I would rename it `describe_signal_methodology` to anchor on the verb.
The data, three weeks later#
I have logging on every tool call. Before the cuts, on a sample of 200 real prompts:
- 132 / 200 (66%) ended in a correct tool selection on the first call - 68 / 200 (34%) involved at least one hallucinated parameter or wrong-tool selection - Average tool calls per user intent: 2.4
After the cuts, on the same kinds of prompts:
- 197 / 200 (98.5%) ended in a correct tool selection on the first call - 3 / 200 (1.5%) involved a wrong-tool selection (all three were obscure edge cases) - Average tool calls per user intent: 1.1
The token cost of menu inflation went from ~5,000 input tokens per turn to ~2,800. Selection accuracy effectively saturated. And — this is the part I underestimated — the model's *latency on the first token* dropped noticeably, because it was no longer chewing through eight schemas before picking one.
The thing I would tell past me#
If I could go back to the morning I shipped the first version, I would tell myself two things.
First: your tool count is a liability, not an asset. Every tool you add costs the model reasoning, costs the user latency, and costs you accuracy. Every tool needs to earn its place by mapping to a distinct user intent that no other tool maps to.
Second: REST API endpoints are not user intents. The clean mental model is: "what would a user say to express this need," not "what HTTP route serves this resource." The mapping is rarely 1:1. Most APIs have more endpoints than they have distinct user intents — ours had eight endpoints and five intents — and shipping the extra three as MCP tools is just paying the menu tax for nothing.
I am at five tools and I think four of them are load-bearing. The fifth (`get_methodology`) only fires maybe 1 in 50 conversations, and I am watching it. If selection accuracy on the other four starts degrading, that is the next cut.
The MCP spec lets you ship as many tools as you want. The model does not reward you for shipping more.
How to use this#
If you are shipping an MCP server, the audit is fast:
- Log every tool call from a representative week of real conversations.
- Cluster the user prompts by intent in plain English (ignore the tool the model picked).
- Count distinct intents.
- If your tool count exceeds your intent count, you have menu inflation. Cut to match.
The repo for the five-tool version is at github.com/kindrat86/mcp-deal-flow-signal. The schemas, the descriptions, and the changelog are all there.
If you have shipped an MCP server, what is your tool count and how did you arrive at it? Public reporting on this trade-off is surprisingly thin and I am collecting examples.