Cost-aware routing and tiered execution in serverless AI workloads
Table of Contents
Cost-aware routing and tiered execution in serverless AI workloads
When you first wire an LLM into a production system, the instinct is to reach for the most capable model available. GPT-4o or Claude 3.7 Sonnet for everything — it just works. Then the invoice arrives.
The problem isn't that powerful models are expensive. It's that most requests don't need them.
A question like "what's the status of task #142?" doesn't require the same substrate as "synthesize last quarter's project outcomes into a board-level narrative." Routing both through your most expensive model is the equivalent of hiring a principal engineer to answer every Slack message. The outputs are technically correct. The economics are not.
This post is about a pattern I've seen emerge independently in three separate codebases over the last several months: **tiered execution with cost-aware routing**. Same structural shape each time, different implementation details. Worth naming.
The pattern
The core idea is simple: model an AI pipeline as a set of ordered execution tiers, a selection function that maps task characteristics to a tier, and a dispatch layer that enforces budget constraints.
```typescript
type Tier = 'free' | 'cheap' | 'mid' | 'expensive';
function selectTier(task: TaskInput): Tier {
if (task.complexity === 'low') return 'free';
if (task.complexity === 'mid') return 'cheap';
if (task.requiresReasoning) return 'expensive';
return 'mid';
}
async function execute(tier: Tier, task: TaskInput): Promise<Result> {
const executor = tierMap[tier];
return executor.run(task);
}
```
In practice there are three moving parts: the tier taxonomy, the selection function, and budget enforcement. Each is independently interesting.
Part 1: Defining tiers
A tier is a named execution substrate with a cost profile. The number of tiers is less important than making the cost gradient explicit. In Cloudflare Workers, a natural progression looks like:
| Tier | Substrate | Cost profile |
|------|-----------|--------------|
| `free` | Workers AI (bundled) | $0 — inference is included in Workers billing |
| `cheap` | Groq, Cerebras | ~$0.10–0.50/M tokens — fast, cost-effective |
| `mid` | GPT-4o mini, Claude Haiku | ~$0.50–2/M tokens |
| `expensive` | Claude Sonnet/Opus, GPT-4o | ~$3–15/M tokens |
The tier names encode your intent, not your vendor choices. When you swap a vendor, the tier name stays stable. Your selection logic doesn't change.
One structural decision worth making early: **tiers should be a discriminated union, not a numeric priority**. A union forces exhaustive handling and makes the mapping legible:
```typescript
type Tier = 'workers_ai' | 'groq' | 'gpt_mini' | 'claude_sonnet';
const TIER_ORDER: Tier[] = ['workers_ai', 'groq', 'gpt_mini', 'claude_sonnet'];
function nextTier(current: Tier): Tier | null {
const idx = TIER_ORDER.indexOf(current);
return idx < TIER_ORDER.length - 1 ? TIER_ORDER[idx + 1] : null;
}
```
Part 2: The selection function
The selection function maps task characteristics to a starting tier. "Starting tier" is the key phrase — it's not final. Budget enforcement can demote it.
Task characteristics that reliably correlate with required tier depth:
- **Complexity** — how much reasoning is involved? A single lookup vs. multi-step synthesis
- **Stakes** — is this user-visible output, or an internal classification?
- **Latency tolerance** — can the user wait 5 seconds, or is this a streaming chat response?
- **Prior failure signal** — has this classification been failing with a cheaper model?
```typescript
function selectTier(task: TaskInput): Tier {
// High-stakes user-visible output: give it room
if (task.stakes === 'high' && task.userVisible) return 'claude_sonnet';
// Reasoning-intensive: needs a capable model
if (task.requiresMultiStep || task.complexity === 'high') return 'gpt_mini';
// Simple classification, lookup, or formatting
if (task.complexity === 'low') return 'workers_ai';
// Default: mid-tier cheap inference
return 'groq';
}
```
The selection function doesn't need to be sophisticated. A simple decision tree outperforms an elaborate scoring model here — you want the logic to be auditable. When a user gets a degraded response, you need to understand exactly why the tier was selected.
**Gap escalation** is a useful refinement: track when a tier produces uncertain output (a structured "I don't know" rather than a confident but wrong answer). When a task class accumulates enough gap signals, automatically start it one tier higher. This is a learning loop without fine-tuning.
```typescript
async function selectTierWithGapSignal(
task: TaskInput,
db: Database,
): Promise<Tier> {
const base = selectTier(task);
const gapCount = await db.getGapSignalCount(task.classification);
// Escalate if this task class has a weak track record at base tier
if (gapCount >= 3) return nextTier(base) ?? base;
return base;
}
```
Part 3: Budget enforcement
This is where the pattern gets interesting. Tiers are meaningless without a mechanism that actually enforces the budget. The naive version — "if we're over budget, use the cheaper model" — is too coarse. You need:
1. **A running cost ledger**: track spend per period (hourly, daily, or per billing cycle)
2. **Per-tier ceilings**: define the maximum monthly spend at each tier
3. **A demotion map**: for each tier, know the fallback tier
```typescript
const TIER_CEILINGS: Record<Tier, number> = {
workers_ai: Infinity, // $0 marginal cost
groq: 5_00, // $5/month ceiling
gpt_mini: 10_00, // $10/month ceiling
claude_sonnet: 20_00, // $20/month ceiling
};
const DEMOTION: Partial<Record<Tier, Tier>> = {
claude_sonnet: 'gpt_mini',
gpt_mini: 'groq',
groq: 'workers_ai',
};
async function resolveExecutionTier(
selected: Tier,
ledger: CostLedger,
): Promise<Tier> {
let tier = selected;
while (tier) {
const remaining = TIER_CEILINGS[tier] - await ledger.getMonthlySpend(tier);
if (remaining > 0) return tier;
const fallback = DEMOTION[tier];
if (!fallback) break;
tier = fallback;
}
return 'workers_ai'; // free tier always available
}
```
The demotion map is the critical piece. It encodes your graceful degradation policy: when claude_sonnet is exhausted, try gpt_mini; when that's gone, fall back to groq. The user gets a slower or less nuanced response, but they get a response.
**Circuit breakers** complement the cost ledger. A circuit breaker trips when a tier starts failing (timeouts, errors, quota exhaustion) and automatically bypasses it for a cooldown window:
```typescript
class TierCircuitBreaker {
private failures = new Map<Tier, { count: number; lastFailure: number }>();
isOpen(tier: Tier, thresholdMs = 60_000): boolean {
const state = this.failures.get(tier);
if (!state || state.count < 3) return false;
return Date.now() - state.lastFailure < thresholdMs;
}
recordFailure(tier: Tier): void {
const state = this.failures.get(tier) ?? { count: 0, lastFailure: 0 };
this.failures.set(tier, { count: state.count + 1, lastFailure: Date.now() });
}
}
```
Combine budget enforcement and circuit breaker in the final dispatch:
```typescript
async function dispatch(task: TaskInput): Promise<Result> {
const selected = await selectTierWithGapSignal(task, db);
const enforced = await resolveExecutionTier(selected, ledger);
// Walk demotion chain until we find an open circuit
let tier = enforced;
while (breaker.isOpen(tier)) {
const fallback = DEMOTION[tier];
if (!fallback) break;
tier = fallback;
}
try {
const result = await executors[tier].run(task);
await ledger.record(tier, result.tokenCost);
return result;
} catch (err) {
breaker.recordFailure(tier);
throw err;
}
}
```
Why this pattern keeps appearing
The interesting thing about this pattern is that it emerges independently in different domains once the cost pressure is real. I've seen it appear in:
- **Multi-provider LLM routers** (cost-sorted provider chains with circuit breakers)
- **AI agent kernels** (executor selection based on intent classification complexity)
- **Symbolic computation systems** (compute depth keyed to problem complexity — single card vs. full analysis)
The structural shape is identical across all three: a discriminated set of tiers, a selection function that maps characteristics to a tier, and a dispatch layer that enforces budget/capacity constraints with graceful demotion.
This convergence is meaningful. It suggests the pattern is load-bearing — something you'll likely need to build if you're running AI workloads at any scale. The earlier you design for it, the easier it is to instrument and tune.
When you don't need this
If you're prototyping or at very low scale, tiered execution is premature optimization. One model, one API key, one invoice line item. The complexity cost of implementing a tier system outweighs the savings until you're processing thousands of requests per day.
The signal that you need it: your LLM API invoice grew faster than your user count last month.
Putting it together
The full dispatch pipeline in about 50 lines:
```typescript
type Tier = 'workers_ai' | 'groq' | 'gpt_mini' | 'claude';
const DEMOTION: Partial<Record<Tier, Tier>> = {
claude: 'gpt_mini',
gpt_mini: 'groq',
groq: 'workers_ai',
};
const TIER_MONTHLY_BUDGET: Record<Tier, number> = {
workers_ai: Infinity,
groq: 500, // cents
gpt_mini: 1000,
claude: 2000,
};
async function selectTier(task: TaskInput, db: D1Database): Promise<Tier> {
const base: Tier = task.complexity === 'high' ? 'claude'
: task.complexity === 'medium' ? 'gpt_mini'
: task.complexity === 'low' ? 'workers_ai'
: 'groq';
const gaps = await db.prepare(
'SELECT gap_signal_count FROM procedural_memory WHERE classification = ?'
).bind(task.classification).first<{ gap_signal_count: number }>();
if ((gaps?.gap_signal_count ?? 0) >= 3) return DEMOTION[base] === undefined ? base : nextHigherTier(base);
return base;
}
async function resolveAfterBudget(tier: Tier, db: D1Database): Promise<Tier> {
let t = tier;
while (t) {
const spent = await db.prepare(
"SELECT COALESCE(SUM(cost_cents),0) AS total FROM llm_cost_ledger WHERE tier = ? AND period = strftime('%Y-%m', 'now')"
).bind(t).first<{ total: number }>();
if ((spent?.total ?? 0) < TIER_MONTHLY_BUDGET[t]) return t;
const next = DEMOTION[t];
if (!next) return 'workers_ai';
t = next;
}
return 'workers_ai';
}
export async function dispatch(task: TaskInput, env: Env): Promise<AIResult> {
const selected = await selectTier(task, env.DB);
const tier = await resolveAfterBudget(selected, env.DB);
return executors[tier](task, env);
}
```
Store this in Cloudflare D1, run it on Workers, and you have cost-aware tiered execution with no external infrastructure. The cost ledger is a table. The circuit state is in-memory (or KV for durability). The whole thing fits in a single Worker file.
Takeaway
Tiered execution isn't a clever optimization trick. It's a structural pattern that most production AI pipelines eventually need. The earlier you design for it explicitly — with named tiers, a legible selection function, and a budget enforcement layer — the easier it is to tune, instrument, and explain when something goes wrong.
The pattern's convergence across independent implementations is the strongest argument for treating it as a first-class design primitive rather than something you bolt on when the invoice gets scary.