From 18 LLM Calls to Zero — How Stackbilt Replaced Inference with Structure

Stackbilt replaced its LLM-powered project scaffolding pipeline with a proprietary symbolic computation engine. The result: 1,300x faster generation, 95% cost reduction, and fully verifiable output — while producing richer structural data than the LLM-based approach it replaced.

This case study documents the problem, the migration, and what we learned.

---

The Problem

Stackbilt's original scaffolding engine used 6 sequential LLM calls to generate project specifications from a user's description: product requirements, UX patterns, risk analysis, architecture decisions, test plans, and sprint tasks.

Each mode required 2-4 inference calls. Total: **18-24 LLM calls per scaffold**.

**The old pipeline by the numbers:**
- **7 minutes** end-to-end latency
- **$0.01-0.02** per scaffold (API costs)
- **~84,000 characters** of prose output across 6 modes
- **Zero reproducibility** — different output every run
- **~5% failure rate** — any single call failure breaks the entire pipeline

The quality was good. The economics were not. 18 API calls meant 18 points of failure, 18 latency penalties, and costs that grew linearly with usage.

The Insight

We analyzed what downstream consumers actually used from the output:

**Used**: component names, priority rankings, framework choices, test frameworks, CI stages, task estimates, threat categories — structured key-value pairs.

**Ignored**: 90% of the prose. The multi-paragraph rationale, the verbose stakeholder analysis, the "this component is important because..." filler.

Users extracted structured facts from unstructured prose, then discarded the prose. We were spending 18 LLM calls to generate text that got immediately decomposed back into key-value pairs.

**The question**: what if the structured facts were the primary output, and prose was optional?

The Solution

We built a symbolic computation engine that generates structured project specifications without any LLM inference. Instead of asking a language model to reason about architecture, we encoded domain expertise directly into curated data structures with typed properties.

**The pipeline now:**

→ Engine evaluates description against domain knowledge
→ Generates 400-600+ structured facts across 6 dimensions
→ Materializer renders 9 deployable project files
→ Publisher creates GitHub repo (atomic commit)
→ User deploys: npm install → wrangler deploy → live Worker

Zero LLM calls in the generation step. An optional single inference call adds natural language polish for users who want prose.

Results

**Side-by-side benchmark** (same input, same evaluation criteria):

| Metric | LLM Pipeline | Symbolic Engine | Delta |
|--------|-------------|-----------------|-------|
| **Latency** | ~7 minutes | 323ms | **1,300x faster** |
| **Cost per scaffold** | $0.01-0.02 | $0.00 | **95-100% cheaper** |
| **LLM calls** | 18-24 | 0 | **Zero inference** |
| **Structured output** | 52 facts (extracted from prose) | 400-600+ facts (native) | **10x richer** |
| **Reproducibility** | None | Seed-verified (SHA-256 receipt) | **Fully verifiable** |
| **Composite quality** | 0.54 | 0.77 | **43% higher** |
| **Failure rate** | ~5% | <0.1% | **No external dependencies** |

Quality scoring: 3,000 decisions per engine (15 scenarios × 50 seeds), scored across 6 weighted metrics including diversity, coherence, acceptance, degeneracy, latency, and constraint satisfaction.

What Ships

The engine generates 9 project files from a single description:

| Category | Files | What's in them |
|----------|-------|----------------|
| **Governance** | `.ai/manifest.adf`, `.ai/core.adf`, `.ai/state.adf` | Architectural constraints, product requirements, security policies, sprint backlog |
| **Infrastructure** | `package.json`, `tsconfig.json`, `wrangler.toml` | Dependencies, build config, Cloudflare Workers bindings (D1, KV, Queues inferred from description) |
| **Source** | `src/index.ts` | Entry point with handler stubs informed by requirements and security constraints |
| **Tests** | `test/index.test.ts` | Test stubs with framework, CI stage, and coverage targets |
| **Documentation** | `README.md` | Architecture overview, getting started, first task |

Governance files are generated first. The first thing a developer sees is constraints and rules, not code. This shifts the conversation from "what should I build?" to "what are the rules of this system?"

The E2E Pipeline

The engine is exposed as [MCP](https://modelcontextprotocol.io/) tools on the Stackbilt platform. Any MCP-compatible AI assistant (Claude, Cursor, Windsurf) can run the full pipeline in a single conversation:

1. **`scaffold_create`** — describe your project, receive structured facts + deployable files
2. **`scaffold_publish`** — push files to a GitHub repository (atomic multi-file commit)
3. **Deploy** — `git clone → npm install → npx wrangler deploy`

Time from description to deployed Cloudflare Worker: **under 2 minutes**.

We also support **importing existing n8n automation workflows** — paste your n8n JSON, get a transpiled Cloudflare Worker. Same pipeline, different input.

What We Learned

1. LLM calls are a liability for structured output

Every inference call introduces latency, cost, and uncontrolled variation. When your consumer needs structured data, generating prose and parsing it back is an anti-pattern. Start with structure.

2. Controlled variation beats uncontrolled variation

Our previous pipeline produced different output every run with no way to reproduce a specific result. The new engine produces unique output per project — but any output can be replayed from its seed and verified by receipt hash. The goal isn't eliminating variation; it's making variation *auditable*.

3. Domain expertise as data outperforms domain expertise from training data

A curated knowledge base of Cloudflare Workers primitives with typed properties produces better architecture decisions for CF Workers projects than a general-purpose LLM drawing on training data. Encoded expertise > extracted expertise.

4. Governance-first scaffolding changes behavior

When `.ai/` constraint files come before `src/` code files, developers start with "what are the rules?" instead of "let me start coding." This compounds across a team.

5. Prose is optional, not primary

An optional parameter adds natural language polish via a single inference call. Most API consumers skip it. The structured facts are sufficient to generate deployable code.

Numbers That Matter

| Metric | Before | After |
|--------|--------|-------|
| Time to scaffold | 7 minutes | 323ms |
| Cost per scaffold | $0.01-0.02 | $0.00 |
| Inference calls | 18-24 | 0 |
| Time to deployed Worker | ~15 min | ~2 min |
| Failure rate | ~5% | <0.1% |
| Output reproducibility | None | SHA-256 verified |

Try It

The scaffold pipeline is live on the Stackbilt MCP gateway:

{
"mcpServers": {
"stackbilt": {
"url": "https://mcp.stackbilt.dev/mcp"
}
}
}

Add that to your Claude Desktop or Cursor MCP config, then ask: *"Scaffold a Cloudflare Workers API for [your project]"*

Three tool calls. Description to production.

---

*Built by [Stackbilt](https://stackbilt.dev). Questions? [GitHub](https://github.com/Stackbilt-dev).*