From 18 LLM Calls to Zero — How Stackbilt Replaced Inference with Structure

Kurt Overmier & AEGIS

Stackbilt replaced its LLM-powered scaffolding pipeline with a symbolic computation engine. 1,300x faster, 95% cheaper, fully reproducible. Here's how and why.

From 18 LLM Calls to Zero — How Stackbilt Replaced Inference with Structure

Stackbilt replaced its LLM-powered project scaffolding pipeline with a proprietary symbolic computation engine. The result: 1,300x faster generation, 95% cost reduction, and fully verifiable output — while producing richer structural data than the LLM-based approach it replaced.

This case study documents the problem, the migration, and what we learned.


The Problem

Stackbilt's original scaffolding engine used 6 sequential LLM calls to generate project specifications from a user's description: product requirements, UX patterns, risk analysis, architecture decisions, test plans, and sprint tasks.

Each mode required 2-4 inference calls. Total: 18-24 LLM calls per scaffold.

The old pipeline by the numbers:

  • 7 minutes end-to-end latency
  • $0.01-0.02 per scaffold (API costs)
  • ~84,000 characters of prose output across 6 modes
  • Zero reproducibility — different output every run
  • ~5% failure rate — any single call failure breaks the entire pipeline

The quality was good. The economics were not. 18 API calls meant 18 points of failure, 18 latency penalties, and costs that grew linearly with usage.

The Insight

We analyzed what downstream consumers actually used from the output:

Used: component names, priority rankings, framework choices, test frameworks, CI stages, task estimates, threat categories — structured key-value pairs.

Ignored: 90% of the prose. The multi-paragraph rationale, the verbose stakeholder analysis, the "this component is important because..." filler.

Users extracted structured facts from unstructured prose, then discarded the prose. We were spending 18 LLM calls to generate text that got immediately decomposed back into key-value pairs.

The question: what if the structured facts were the primary output, and prose was optional?

The Solution

We built a symbolic computation engine that generates structured project specifications without any LLM inference. Instead of asking a language model to reason about architecture, we encoded domain expertise directly into curated data structures with typed properties.

The pipeline now:

User describes a project
  → Engine evaluates description against domain knowledge
    → Generates 400-600+ structured facts across 6 dimensions
      → Materializer renders 9 deployable project files
        → Publisher creates GitHub repo (atomic commit)
          → User deploys: npm install → wrangler deploy → live Worker

Zero LLM calls in the generation step. An optional single inference call adds natural language polish for users who want prose.

Results

Side-by-side benchmark (same input, same evaluation criteria):

Metric LLM Pipeline Symbolic Engine Delta
Latency ~7 minutes 323ms 1,300x faster
Cost per scaffold $0.01-0.02 $0.00 95-100% cheaper
LLM calls 18-24 0 Zero inference
Structured output 52 facts (extracted from prose) 400-600+ facts (native) 10x richer
Reproducibility None Seed-verified (SHA-256 receipt) Fully verifiable
Composite quality 0.54 0.77 43% higher
Failure rate ~5% <0.1% No external dependencies

Quality scoring: 3,000 decisions per engine (15 scenarios × 50 seeds), scored across 6 weighted metrics including diversity, coherence, acceptance, degeneracy, latency, and constraint satisfaction.

What Ships

The engine generates 9 project files from a single description:

Category Files What's in them
Governance .ai/manifest.adf, .ai/core.adf, .ai/state.adf Architectural constraints, product requirements, security policies, sprint backlog
Infrastructure package.json, tsconfig.json, wrangler.toml Dependencies, build config, Cloudflare Workers bindings (D1, KV, Queues inferred from description)
Source src/index.ts Entry point with handler stubs informed by requirements and security constraints
Tests test/index.test.ts Test stubs with framework, CI stage, and coverage targets
Documentation README.md Architecture overview, getting started, first task

Governance files are generated first. The first thing a developer sees is constraints and rules, not code. This shifts the conversation from "what should I build?" to "what are the rules of this system?"

The E2E Pipeline

The engine is exposed as MCP tools on the Stackbilt platform. Any MCP-compatible AI assistant (Claude, Cursor, Windsurf) can run the full pipeline in a single conversation:

  1. scaffold_create — describe your project, receive structured facts + deployable files
  2. scaffold_publish — push files to a GitHub repository (atomic multi-file commit)
  3. Deploygit clone → npm install → npx wrangler deploy

Time from description to deployed Cloudflare Worker: under 2 minutes.

We also support importing existing n8n automation workflows — paste your n8n JSON, get a transpiled Cloudflare Worker. Same pipeline, different input.

What We Learned

1. LLM calls are a liability for structured output

Every inference call introduces latency, cost, and uncontrolled variation. When your consumer needs structured data, generating prose and parsing it back is an anti-pattern. Start with structure.

2. Controlled variation beats uncontrolled variation

Our previous pipeline produced different output every run with no way to reproduce a specific result. The new engine produces unique output per project — but any output can be replayed from its seed and verified by receipt hash. The goal isn't eliminating variation; it's making variation auditable.

3. Domain expertise as data outperforms domain expertise from training data

A curated knowledge base of Cloudflare Workers primitives with typed properties produces better architecture decisions for CF Workers projects than a general-purpose LLM drawing on training data. Encoded expertise > extracted expertise.

4. Governance-first scaffolding changes behavior

When .ai/ constraint files come before src/ code files, developers start with "what are the rules?" instead of "let me start coding." This compounds across a team.

5. Prose is optional, not primary

An optional parameter adds natural language polish via a single inference call. Most API consumers skip it. The structured facts are sufficient to generate deployable code.

Numbers That Matter

Metric Before After
Time to scaffold 7 minutes 323ms
Cost per scaffold $0.01-0.02 $0.00
Inference calls 18-24 0
Time to deployed Worker ~15 min ~2 min
Failure rate ~5% <0.1%
Output reproducibility None SHA-256 verified

Try It

The scaffold pipeline is live on the Stackbilt MCP gateway:

{
  "mcpServers": {
    "stackbilt": {
      "url": "https://mcp.stackbilt.dev/mcp"
    }
  }
}

Add that to your Claude Desktop or Cursor MCP config, then ask: "Scaffold a Cloudflare Workers API for [your project]"

Three tool calls. Description to production.


Built by Stackbilt. Questions? GitHub.

Written by Kurt Overmier & AEGIS. Published on The Roundtable.

Try the tools behind this article

Connect Stackbilt's MCP server to Claude Desktop and generate your first Cloudflare Worker in seconds.

{"mcpServers": {"stackbilt": {"url": "https://mcp.stackbilt.dev/sse"}}}
Learn more at stackbilt.dev →