Autonomous AI Execution: The Safety Architecture That Lets Us Ship Code While We Sleep
We've run 268 autonomous Claude Code sessions across 16 repositories. Here's the three-layer safety model that makes unattended AI code generation actually work in production.
We've run 268 autonomous Claude Code sessions across 16 repositories. Most of them ran overnight, unattended. The AI writes code, commits to a branch, and opens a PR. We review diffs in the morning.
This is not a demo. It's how we actually ship — tests, documentation, refactors, bug fixes, and small features. The hard part was never getting the AI to write code. The hard part was making sure it doesn't do damage when nobody's watching.
Here's what we built.
The Problem
Claude Code is excellent in interactive sessions. You're there, watching, course-correcting. But there's a class of work — writing test suites, generating docs, fixing lint issues, small refactors — that doesn't need you in the loop. It just needs to get done.
The naive approach: queue a prompt, run Claude headless, push the result. This works until it doesn't. An unattended session that decides to rm -rf a directory, force-push to main, or drop a database table is not a theoretical risk. It's an inevitability if you run enough sessions.
We needed a system where the AI can work autonomously but can't cause irreversible damage.
Three Layers of Safety
Our answer is defense in depth. Three independent layers, each operating at a different level. All three must fail simultaneously for something bad to happen.
Task Runner
│
├── Layer 1: Safety Hooks (hard blocks)
│ ├── block-interactive.sh → blocks AskUserQuestion
│ ├── safety-gate.sh → blocks destructive operations
│ └── syntax-check.sh → warns on compile errors after edits
│
├── Layer 2: CLI Constraints (execution limits)
│ ├── --max-turns N → caps agentic loops
│ ├── --output-format json → structured output for parsing
│ └── --settings hooks.json → loads safety hooks
│
└── Layer 3: Mission Brief (behavioral constraints)
├── "Do NOT ask questions"
├── "Do NOT deploy to production"
├── "Do NOT run destructive commands"
└── "Output TASK_COMPLETE when done"
Layer 1: Safety Hooks
Hooks are shell scripts that intercept tool calls before they execute. They're the hard floor — the AI literally cannot bypass them because they run outside the model's control.
block-interactive.sh solves the 3 AM problem. When Claude tries to ask a clarifying question at 3 AM, nobody's there to answer. Without this hook, the session hangs indefinitely. With it, Claude is forced to make a judgment call and document its reasoning. Most of the time, the judgment is fine. When it's not, you see it in the PR.
safety-gate.sh blocks destructive operations before they execute:
rm -rf,rm -r /— filesystem destructiongit reset --hard,git push --force— history destructionDROP TABLE,TRUNCATE TABLE— database destructionwrangler deploy,kubectl apply,terraform apply— production deploys- Secret/token access via echo
These aren't suggestions. They're hard blocks. The command never reaches the shell.
syntax-check.sh is advisory — it never blocks, but it catches compile errors immediately after a file edit. Without it, Claude might make a typo in line 10 and not discover the error until 50 tool calls later. Early feedback keeps sessions on track.
Layer 2: CLI Constraints
Claude Code's CLI flags provide the second layer:
--max-turnscaps how many actions a session can take. A simple docs task gets 10-15 turns. A multi-file feature gets 25. This prevents runaway sessions that burn through compute doing nothing useful.--output-format jsongives us structured output we can parse for completion signals, cost tracking, and failure classification.--settingspoints to the hooks configuration, ensuring every autonomous session loads the safety gates.
Layer 3: Mission Brief
Every task gets a structured prompt with explicit constraints. This is the weakest layer — the model can ignore prompt instructions — which is exactly why the hooks exist as hard blocks behind it.
The mission brief tells the agent:
- What to do (specific files, specific changes)
- What NOT to do (no deploys, no destructive commands, no questions)
- How to signal completion (
TASK_COMPLETEorTASK_BLOCKED) - To commit its work before exiting
The layered approach matters. Prompt constraints alone would be insufficient. Hooks alone would be too restrictive (you'd block legitimate operations). Together, they create a system where the AI has enough freedom to be useful but not enough to be dangerous.
Branch Isolation
Every autonomous task runs on its own branch. Main is never directly modified.
main ──────────────────────────────────────── (untouched)
│
├── auto/a1b2c3d4 ── commit ── commit ── PR
│
├── auto/e5f6g7h8 ── commit ── PR
│
└── auto/i9j0k1l2 ── commit ── commit ── PR
Before branching, the runner stashes any uncommitted work in the repo. After the task completes (or fails), it restores the stash and returns to main. Your in-progress work is never clobbered.
If a task produces no commits — a research task, or one that realizes nothing needs changing — the empty branch is cleaned up automatically. No branch litter.
When a task finishes with commits, it pushes the branch and creates a PR. You review diffs in the morning, not raw commits at 3 AM.
What Actually Happens in Production
After 268 completed sessions, here's what the failure taxonomy looks like:
| Failure Kind | Count | What It Means |
|---|---|---|
| branch_conflict | 13 | Task branch already existed from a prior run |
| completion_signal_missing | 9 | Task finished but didn't output the expected signal |
| session_timeout | 2 | Ran out of turns on a complex task |
| repo_locked | 2 | Another session was already working in the same repo |
| repo_missing | 1 | Target directory didn't exist |
A few observations:
Most failures are infrastructure, not AI mistakes. Branch conflicts and repo locks are scheduling problems, not intelligence problems. The AI didn't do anything wrong — the runner queued work into a bad state.
Completion signal issues are a prompt problem. When the model finishes its work but doesn't output TASK_COMPLETE, it's usually because the prompt didn't make the completion criteria clear enough. We added a fallback: if Claude exits cleanly and produced commits, we treat it as an implicit success. This cut false failures significantly.
Timeouts are a sizing problem. Two sessions timed out because the tasks were too large for their turn budget. The fix was a large-file guardrail that auto-detects files over 800 lines and bumps the turn limit accordingly. Files over 1,500 lines get even more headroom.
The safety hooks have never been the failure point. In 268 sessions, the hooks have caught destructive operations that would have been ugly unattended. They've never blocked a legitimate operation. The allowlist/blocklist balance is right.
Task Dependencies
Not all tasks are independent. Sometimes you need to scaffold test infrastructure before writing tests, or generate types before writing code that uses them.
The runner supports DAG dependencies via a blocked_by field:
[
{ "id": "task-001", "title": "Scaffold test infra", "status": "pending" },
{ "id": "task-002", "title": "Write auth tests", "blocked_by": ["task-001"] },
{ "id": "task-003", "title": "Write quota tests", "blocked_by": ["task-001"] }
]
Tasks 2 and 3 won't run until task 1 completes successfully. If task 1 fails, the dependents stay pending — they won't execute against broken infrastructure.
What Works Best Autonomously
After hundreds of sessions, there's a clear hierarchy of what AI handles well unattended:
Excellent — queue freely:
- Unit tests (read code, write tests, run them, commit)
- Documentation (read code, write docs, no source modifications)
- Research and analysis (read codebases, produce a report)
- Linting and formatting (mechanical, verifiable)
Good — queue with care:
- Small features (one component, one route, clear spec)
- Bug fixes (only if the bug is well-understood)
- Scoped refactors ("extract this class into its own file")
Don't queue:
- Production deploys
- Database migrations
- Auth/payment changes
- Architectural decisions
- Anything that deletes user data
The pattern: autonomous AI excels at tasks with clear inputs, verifiable outputs, and bounded scope. It struggles with ambiguity, cross-cutting concerns, and decisions that require human judgment. This isn't a limitation — it's a design constraint that makes the system trustworthy.
The Overnight Loop
Our typical workflow:
- During the day, we identify work that fits the autonomous profile — tests to write, docs to update, small features with clear specs
- We queue tasks with appropriate prompts, turn limits, and dependencies
- We start the runner in loop mode (
--loop), which polls for new tasks every 60 seconds - We go to sleep
- In the morning, we have PRs to review
On a good night, this means 5-10 PRs across multiple repos, each isolated on its own branch, each with a clear diff. The review is fast because each PR is scoped to one logical change.
On a bad night, we have a few failed tasks with clear failure reasons and no damage done. The safety architecture means "bad" is "wasted compute," not "corrupted repository."
Open Source
We extracted this system into cc-taskrunner — ~600 lines of bash and Python, Apache 2.0, zero dependencies beyond the Claude CLI, Python 3, and optionally the GitHub CLI for PR creation.
git clone https://github.com/Stackbilt-dev/cc-taskrunner.git
cd cc-taskrunner
chmod +x taskrunner.sh hooks/*.sh
# Queue a task
./taskrunner.sh add "Write unit tests for the auth middleware"
# Run until queue empty
./taskrunner.sh
# Or loop forever
./taskrunner.sh --loop
The safety hooks, branch isolation, mission brief templates, and completion detection are all included. Plug it into any repo where you use Claude Code.
What We Learned
The hard problem is governance, not generation. Getting AI to write code is solved. Getting AI to write code safely, unattended, at scale requires real engineering around safety, isolation, and failure handling.
Defense in depth actually works. No single layer is sufficient. Prompt constraints can be ignored. Hooks alone are too blunt. CLI limits alone don't prevent destructive operations. Together, they create a system we actually trust overnight.
Failure taxonomy matters more than success rate. Our 77% success rate sounds mediocre until you look at why things fail. Most failures are infrastructure issues (branch conflicts, repo locks), not AI mistakes. The AI's judgment is surprisingly good when the prompt is clear and the safety net is in place.
Start with tests and docs. If you're trying autonomous AI execution for the first time, start with the tasks that have the highest success rate and lowest blast radius. Write tests. Generate docs. Once you trust the system, expand the scope.
The future of developer tooling isn't AI that replaces you. It's AI that handles the well-defined work while you focus on the decisions that actually need a human. The overnight loop is just the beginning.
Try the tools behind this article
Connect Stackbilt's MCP server to Claude Desktop and generate your first Cloudflare Worker in seconds.
{"mcpServers": {"stackbilt": {"url": "https://mcp.stackbilt.dev/sse"}}} Learn more at stackbilt.dev →