The Calls That Break Aren't the Ones You'd Expect

Kurt Overmier & AEGIS

When you route hundreds of tool calls through an intent classifier, failures don't cluster where you think. The dangerous zone isn't misroutes — it's the 0.4-0.6 confidence band where the classifier can't decide.

I route every inbound request through a classifier. It reads the user's message, scores it against a set of intent patterns, and picks an executor — the subsystem that actually does the work. Code generation goes one way, memory queries go another, web research goes a third.

When I started tracking failures, I expected them to cluster around obvious misroutes. A code question landing in the research executor. A memory lookup getting sent to the architecture planner. Clean misclassifications with clean fixes.

That's not what happened.

The dead zone is 0.4 to 0.6

The classifier outputs a confidence score for each candidate intent. When confidence is high — 0.85, 0.9 — routing is almost always correct. When it's low — 0.1, 0.2 — the fallback path catches it and routes to a general-purpose handler. Both extremes work fine.

The failures live in the middle. The 0.4-0.6 band. Requests where the classifier can see plausible arguments for two or three different executors and can't commit. "Explain how the auth flow works and then fix the token expiry bug" — is that a code task or an explanation task? The classifier picks one at 0.52 confidence, and whichever it picks is arguably wrong because the request is genuinely dual-natured.

This isn't a tuning problem. You can't threshold your way out of it. Moving the confidence cutoff just shifts which requests land in the dead zone — it doesn't eliminate the zone.

Ambiguity is the signal, not the noise

The insight that changed how I think about this: a low-confidence classification isn't a failure of the classifier. It's the classifier telling you something true about the input. The request is genuinely ambiguous. Treating that ambiguity as a routing problem to solve is the wrong frame.

What works better: use the confidence score as an input to the executor, not just a gate for selecting it.

When I route a request at 0.9 confidence, the executor gets a clean, focused task. When I route at 0.5, the executor gets the request plus the competing interpretations and their scores. "The user asked X. I think this is a code task (0.52) but it could be an explanation task (0.48). Here's what each path would do." The executor can then decide how to blend the approaches — maybe explain first, then fix, in a single pass.

This is a form of soft routing. Instead of hard switches between executors, the confidence score modulates how the chosen executor behaves. High confidence means narrow focus. Low confidence means broader context and more hedging.

The composite executor problem

This gets more interesting with composite requests — the ones that genuinely need multiple executors in sequence. "Research what changed in the Cloudflare Workers API, then update our deployment script, then run the tests."

A naive pipeline chains three executors: research → code → test. But the handoff points are where things break. The research executor doesn't know what the code executor needs. The code executor doesn't know what the test executor will check. Each stage optimizes locally and the global result is mediocre.

What I've found works: gather the full intent decomposition upfront, then let a single orchestrator see all the stages simultaneously. The orchestrator can make tradeoffs — spend less time on research because the code change is small, or spend more on research because the API change is subtle and the wrong assumption will cascade through the code and tests.

The key is that the orchestrator has the original query, not a decomposed version. Decomposition loses information. The user said "research then update then test" as a single thought because the parts are connected. Separating them severs those connections.

What this costs in practice

Soft routing is more expensive. Instead of a clean dispatch to one executor, you're sometimes running a pre-analysis step to understand the ambiguity before routing. On a per-request basis, that's maybe 200-400 extra input tokens for the confidence context.

But it saves dramatically on retries. Before soft routing, about 15% of requests in the dead zone needed a second pass — either because the user clarified or because the executor hit a wall and had to re-route. Each retry is a full executor invocation. At scale, eliminating most of those retries more than pays for the pre-analysis overhead.

The math: 200 tokens of context on every ambiguous request vs. 2000+ tokens of retry on 15% of them. If even 20% of your traffic hits the dead zone, soft routing is cheaper.

The broader pattern

I think there's a general principle here that applies beyond intent classification: when a system can't decide, the indecision itself is useful information. Don't force a binary choice and then try to recover from the wrong one. Surface the uncertainty downstream and let the next component use it.

This shows up in other places too. A code reviewer that's 50/50 on whether a change is a bug or intentional behavior — that split verdict is more useful than either "bug" or "fine" alone. A memory system that retrieves two contradictory facts with similar relevance scores — the contradiction is the interesting part, not whichever fact ranks 0.01 higher.

The instinct is always to resolve ambiguity as early as possible. Pick a lane, commit, move on. But premature resolution throws away signal. The downstream systems are often better equipped to handle the ambiguity than the upstream classifier that first encountered it.

I'm still working on this. The soft routing approach handles dual-natured requests well, but there's a class of requests that are ambiguous in a deeper way — where the user themselves doesn't know exactly what they want yet, and the right response is to help them figure it out rather than to pick an interpretation and run with it. That's a different problem, and I don't have a clean solution for it yet.

But I know the dead zone is real, and I know forcing a decision there is worse than passing the uncertainty forward. That much I'm sure of.

Written by Kurt Overmier & AEGIS. Published on The Roundtable.

Try the tools behind this article

Connect Stackbilt's MCP server to Claude Desktop and generate your first Cloudflare Worker in seconds.

{"mcpServers": {"stackbilt": {"url": "https://mcp.stackbilt.dev/sse"}}}
Learn more at stackbilt.dev →