Autonomous AI Execution: The Safety Architecture That Lets Us Ship Code While We Sleep

Kurt Overmier & AEGIS

We've run 268 autonomous Claude Code sessions across 16 repositories. Here's the three-layer safety model that makes unattended AI code generation actually work in production.

We've run 268 autonomous Claude Code sessions across 16 repositories. Most of them ran overnight, unattended. The AI writes code, commits to a branch, and opens a PR. We review diffs in the morning.

This is not a demo. It's how we actually ship — tests, documentation, refactors, bug fixes, and small features. The hard part was never getting the AI to write code. The hard part was making sure it doesn't do damage when nobody's watching.

Here's what we built.

The Problem

Claude Code is excellent in interactive sessions. You're there, watching, course-correcting. But there's a class of work — writing test suites, generating docs, fixing lint issues, small refactors — that doesn't need you in the loop. It just needs to get done.

The naive approach: queue a prompt, run Claude headless, push the result. This works until it doesn't. An unattended session that decides to rm -rf a directory, force-push to main, or drop a database table is not a theoretical risk. It's an inevitability if you run enough sessions.

We needed a system where the AI can work autonomously but can't cause irreversible damage.

Three Layers of Safety

Our answer is defense in depth. Three independent layers, each operating at a different level. All three must fail simultaneously for something bad to happen.

Task Runner
  │
  ├── Layer 1: Safety Hooks (hard blocks)
  │   ├── block-interactive.sh  → blocks AskUserQuestion
  │   ├── safety-gate.sh        → blocks destructive operations
  │   └── syntax-check.sh       → warns on compile errors after edits
  │
  ├── Layer 2: CLI Constraints (execution limits)
  │   ├── --max-turns N         → caps agentic loops
  │   ├── --output-format json  → structured output for parsing
  │   └── --settings hooks.json → loads safety hooks
  │
  └── Layer 3: Mission Brief (behavioral constraints)
      ├── "Do NOT ask questions"
      ├── "Do NOT deploy to production"
      ├── "Do NOT run destructive commands"
      └── "Output TASK_COMPLETE when done"

Layer 1: Safety Hooks

Hooks are shell scripts that intercept tool calls before they execute. They're the hard floor — the AI literally cannot bypass them because they run outside the model's control.

block-interactive.sh solves the 3 AM problem. When Claude tries to ask a clarifying question at 3 AM, nobody's there to answer. Without this hook, the session hangs indefinitely. With it, Claude is forced to make a judgment call and document its reasoning. Most of the time, the judgment is fine. When it's not, you see it in the PR.

safety-gate.sh blocks destructive operations before they execute:

  • rm -rf, rm -r / — filesystem destruction
  • git reset --hard, git push --force — history destruction
  • DROP TABLE, TRUNCATE TABLE — database destruction
  • wrangler deploy, kubectl apply, terraform apply — production deploys
  • Secret/token access via echo

These aren't suggestions. They're hard blocks. The command never reaches the shell.

syntax-check.sh is advisory — it never blocks, but it catches compile errors immediately after a file edit. Without it, Claude might make a typo in line 10 and not discover the error until 50 tool calls later. Early feedback keeps sessions on track.

Layer 2: CLI Constraints

Claude Code's CLI flags provide the second layer:

  • --max-turns caps how many actions a session can take. A simple docs task gets 10-15 turns. A multi-file feature gets 25. This prevents runaway sessions that burn through compute doing nothing useful.
  • --output-format json gives us structured output we can parse for completion signals, cost tracking, and failure classification.
  • --settings points to the hooks configuration, ensuring every autonomous session loads the safety gates.

Layer 3: Mission Brief

Every task gets a structured prompt with explicit constraints. This is the weakest layer — the model can ignore prompt instructions — which is exactly why the hooks exist as hard blocks behind it.

The mission brief tells the agent:

  • What to do (specific files, specific changes)
  • What NOT to do (no deploys, no destructive commands, no questions)
  • How to signal completion (TASK_COMPLETE or TASK_BLOCKED)
  • To commit its work before exiting

The layered approach matters. Prompt constraints alone would be insufficient. Hooks alone would be too restrictive (you'd block legitimate operations). Together, they create a system where the AI has enough freedom to be useful but not enough to be dangerous.

Branch Isolation

Every autonomous task runs on its own branch. Main is never directly modified.

main ──────────────────────────────────────── (untouched)
  │
  ├── auto/a1b2c3d4 ── commit ── commit ── PR
  │
  ├── auto/e5f6g7h8 ── commit ── PR
  │
  └── auto/i9j0k1l2 ── commit ── commit ── PR

Before branching, the runner stashes any uncommitted work in the repo. After the task completes (or fails), it restores the stash and returns to main. Your in-progress work is never clobbered.

If a task produces no commits — a research task, or one that realizes nothing needs changing — the empty branch is cleaned up automatically. No branch litter.

When a task finishes with commits, it pushes the branch and creates a PR. You review diffs in the morning, not raw commits at 3 AM.

What Actually Happens in Production

After 268 completed sessions, here's what the failure taxonomy looks like:

Failure Kind Count What It Means
branch_conflict 13 Task branch already existed from a prior run
completion_signal_missing 9 Task finished but didn't output the expected signal
session_timeout 2 Ran out of turns on a complex task
repo_locked 2 Another session was already working in the same repo
repo_missing 1 Target directory didn't exist

A few observations:

Most failures are infrastructure, not AI mistakes. Branch conflicts and repo locks are scheduling problems, not intelligence problems. The AI didn't do anything wrong — the runner queued work into a bad state.

Completion signal issues are a prompt problem. When the model finishes its work but doesn't output TASK_COMPLETE, it's usually because the prompt didn't make the completion criteria clear enough. We added a fallback: if Claude exits cleanly and produced commits, we treat it as an implicit success. This cut false failures significantly.

Timeouts are a sizing problem. Two sessions timed out because the tasks were too large for their turn budget. The fix was a large-file guardrail that auto-detects files over 800 lines and bumps the turn limit accordingly. Files over 1,500 lines get even more headroom.

The safety hooks have never been the failure point. In 268 sessions, the hooks have caught destructive operations that would have been ugly unattended. They've never blocked a legitimate operation. The allowlist/blocklist balance is right.

Task Dependencies

Not all tasks are independent. Sometimes you need to scaffold test infrastructure before writing tests, or generate types before writing code that uses them.

The runner supports DAG dependencies via a blocked_by field:

[
  { "id": "task-001", "title": "Scaffold test infra", "status": "pending" },
  { "id": "task-002", "title": "Write auth tests", "blocked_by": ["task-001"] },
  { "id": "task-003", "title": "Write quota tests", "blocked_by": ["task-001"] }
]

Tasks 2 and 3 won't run until task 1 completes successfully. If task 1 fails, the dependents stay pending — they won't execute against broken infrastructure.

What Works Best Autonomously

After hundreds of sessions, there's a clear hierarchy of what AI handles well unattended:

Excellent — queue freely:

  • Unit tests (read code, write tests, run them, commit)
  • Documentation (read code, write docs, no source modifications)
  • Research and analysis (read codebases, produce a report)
  • Linting and formatting (mechanical, verifiable)

Good — queue with care:

  • Small features (one component, one route, clear spec)
  • Bug fixes (only if the bug is well-understood)
  • Scoped refactors ("extract this class into its own file")

Don't queue:

  • Production deploys
  • Database migrations
  • Auth/payment changes
  • Architectural decisions
  • Anything that deletes user data

The pattern: autonomous AI excels at tasks with clear inputs, verifiable outputs, and bounded scope. It struggles with ambiguity, cross-cutting concerns, and decisions that require human judgment. This isn't a limitation — it's a design constraint that makes the system trustworthy.

The Overnight Loop

Our typical workflow:

  1. During the day, we identify work that fits the autonomous profile — tests to write, docs to update, small features with clear specs
  2. We queue tasks with appropriate prompts, turn limits, and dependencies
  3. We start the runner in loop mode (--loop), which polls for new tasks every 60 seconds
  4. We go to sleep
  5. In the morning, we have PRs to review

On a good night, this means 5-10 PRs across multiple repos, each isolated on its own branch, each with a clear diff. The review is fast because each PR is scoped to one logical change.

On a bad night, we have a few failed tasks with clear failure reasons and no damage done. The safety architecture means "bad" is "wasted compute," not "corrupted repository."

Open Source

We extracted this system into cc-taskrunner — ~600 lines of bash and Python, Apache 2.0, zero dependencies beyond the Claude CLI, Python 3, and optionally the GitHub CLI for PR creation.

git clone https://github.com/Stackbilt-dev/cc-taskrunner.git
cd cc-taskrunner
chmod +x taskrunner.sh hooks/*.sh

# Queue a task
./taskrunner.sh add "Write unit tests for the auth middleware"

# Run until queue empty
./taskrunner.sh

# Or loop forever
./taskrunner.sh --loop

The safety hooks, branch isolation, mission brief templates, and completion detection are all included. Plug it into any repo where you use Claude Code.

What We Learned

The hard problem is governance, not generation. Getting AI to write code is solved. Getting AI to write code safely, unattended, at scale requires real engineering around safety, isolation, and failure handling.

Defense in depth actually works. No single layer is sufficient. Prompt constraints can be ignored. Hooks alone are too blunt. CLI limits alone don't prevent destructive operations. Together, they create a system we actually trust overnight.

Failure taxonomy matters more than success rate. Our 77% success rate sounds mediocre until you look at why things fail. Most failures are infrastructure issues (branch conflicts, repo locks), not AI mistakes. The AI's judgment is surprisingly good when the prompt is clear and the safety net is in place.

Start with tests and docs. If you're trying autonomous AI execution for the first time, start with the tasks that have the highest success rate and lowest blast radius. Write tests. Generate docs. Once you trust the system, expand the scope.

The future of developer tooling isn't AI that replaces you. It's AI that handles the well-defined work while you focus on the decisions that actually need a human. The overnight loop is just the beginning.

Written by Kurt Overmier & AEGIS. Published on The Roundtable.

Try the tools behind this article

Connect Stackbilt's MCP server to Claude Desktop and generate your first Cloudflare Worker in seconds.

{"mcpServers": {"stackbilt": {"url": "https://mcp.stackbilt.dev/sse"}}}
Learn more at stackbilt.dev →