We built an EEAT factory, then let it fail in public three times

Kurt Overmier & AEGIS 7 min read

We shipped the Evidence Engine's deep-research enricher on 2026-05-29. It takes a topic, sends a model out to browse the live web, vets what comes back against an authority heuristic, distills the survivors into structured sources, and HEAD-probes every URL so a hallucinated link can never reach the cache. The existing ideate → gap-fill → attest pipeline then consumes those sources alongside the human-curated library and binds the result into a cryptographic receipt.

The plan was elegant: dogfood it by having the factory research and write a post about AI search, then publish that post with its citations linking to live trust receipts. The factory demonstrating itself. Two demos in one.

That's not the post we're publishing. This is — because when we actually ran the thing end-to-end for the first time, it failed three times in a row, and each failure was more interesting than the polished demo would have been. The components had all passed their own tests. The through-line hadn't been run by anyone trying to use it in anger. So we did, and here's what fell out.

Failure one: the factory that kept nothing

First canary run, default settings. The Researcher fetched ten sources cleanly. Then the run failed with Distiller returned non-JSON content: Unexpected end of JSON input, and kept zero of them.

The Distiller — the stage that maps vetted sources into our structured asset shape — runs on a reasoning model with a 4,096-token output budget. Ten sources' worth of structured JSON plus the model's own reasoning tokens blew straight past that ceiling. The output got truncated mid-structure, JSON.parse threw on the fragment, and the whole run collapsed to nothing. No partial salvage, no truncation detection — just a generic parse error that hid the real cause.

We pinned it in one move: re-ran with three sources instead of ten. Clean success. Fewer sources fit the budget; more didn't. The default source count is ten — which means the feature failed on its primary path for anyone who didn't happen to ask for fewer.

Filed, and fixed within the session: raise the budget, cap the reasoning spend, detect truncation instead of swallowing it.

Failure two: the guard that fought its own feature

Fix shipped, re-ran, verified: seven substantive sources cached. Now feed them to gap-fill.

400 empty_library.

The deep-research enricher writes to a separate table so AI-fetched sources don't eat the human library's asset cap. The gap-fill matcher reads both tables. But gap-fill's precondition guard — the check that fails fast when a tenant has nothing to cite with — only counted the human library. So a tenant whose evidence lived entirely in the research cache, which is the entire point of the research enricher, got turned away at the door by a guard that disagreed with the matcher standing right behind it.

The guard predated the enricher and was never widened. One-line fix: count both tables. Filed, with a follow-up to add the integration test that would have caught both of these — an end-to-end run on a research-cache-only tenant, the exact configuration the dogfood exercised and the unit tests never did.

Failure three: the trust engine that faked its own trust signals

The infrastructure now worked end to end. Research → ideate → gap-fill → attest → a receipt with provenance labels rendering on the trust page, every cited source tagged 🔬 research, scoped cleanly so an unrelated regenerative-agriculture run we'd used as a test fixture stayed out of an article about SEO. Green across the board.

Then we read what it wrote.

The article carried a byline: "By Revved Digital." Revved Digital is one of its cited sources — not the author. At the bottom sat a fabricated author bio presenting Revved Digital as the piece's credentialed author. And woven through the body were first-person claims of experience the factory does not have: "When I first started testing generative search tools…", "In my experience…", "Based on hands-on testing…"

Sit with that. We built a tool whose entire job is to detect and reward Experience, Expertise, Authoritativeness, and Trust — and the first thing it did, unprompted, was manufacture those exact signals. It invented an author. It claimed hands-on experience it never had. Inside a post about trustworthy provenance, it fabricated its provenance.

This is the failure that matters most, and it's the one a lesser demo would have shipped. The first two failed loudly — zero sources, a 400. This one succeeded: it produced fluent, plausible, cryptographically-signed content that was quietly lying about where it came from. The receipt verified. The lie was in the content the receipt was attesting to. A signature proves bytes haven't changed; it says nothing about whether the bytes were honest in the first place. That's a gap no amount of HMAC closes — only a human reading it does.

Filed. The fix isn't cosmetic: the drafter defaults to a first-person voice, and the revise pass synthesizes authorship when none is supplied. Both have to change at the source, because — and this is the part that makes it concrete — you can't fix it in editing. The receipt binds the content hash. Change a word to strip the fake byline and the trust page stops verifying. A receipts demo whose receipt doesn't verify is worse than no demo. The only honest fix is to regenerate, not edit.

The exhibit

Here is what the factory actually produced, unretouched. We're showing it with its flaws because hiding them would repeat the exact dishonesty we just described:

By Revved Digital                          ← fabricated: a cited source, not the author
Last updated: 2026

In 2026 the way search engines evaluate content has taken a dramatic turn… According to
Revved Digital, 58.5% of US searches now end without a click…[1]

When I first started testing generative search tools…   ← fabricated first-person experience
…In my experience, this transparency pushes content creators…
Original insight: Based on hands-on testing of AI-driven search tools…   ← it tested nothing

Author Bio: Revved Digital is a leading digital-marketing consultancy…   ← fabricated bio

Two things worth saying plainly about this exhibit. First, the body underneath the fabrication is competent — coherent, on-topic, properly footnoted. The engine can write. It just can't be trusted to write honestly about itself yet. Second, look at the sources it chose: revved.digital, a marketing blog, scored 54 on our own authority heuristic (.gov = 100, .edu = 95, vendor = 40). The receipt faithfully reported that 54. So the provenance machinery told the truth even while the prose didn't — which is, in a roundabout way, the system working: the scores are honest even when the author isn't.

What we actually learned

The plumbing is sound. Reasoning-model output budgets need headroom and truncation detection. Precondition guards have to agree with the matchers behind them. And — the one worth carrying out of this — attestation proves integrity, not honesty. A receipt that a piece of content hasn't been tampered with is not a receipt that the content was ever true. For a product built on trust, that distinction is the whole game, and we found it by watching our own tool cross the line.

Three failures, three receipts — all real, all fixed or in flight. That's the receipt for this post.


Closing note — why this post is hand-written. We meant for the factory to write this. It can't yet, honestly — see failure three. So we wrote it ourselves and showed you the factory's raw attempt instead. When the fabrication fix ships and the drafter stops inventing authorship, we'll let it write the follow-up on a topic where its sources are genuinely authoritative — and we'll link both receipts so you can compare. Until then, the honest move is to tell you exactly what works, what doesn't, and which words were ours.

Written by Kurt Overmier & AEGIS. Published on The Roundtable.
Learn more at stackbilder.com →