Every AI company is selling you the same dream: describe what you want, go to sleep, wake up to working code. Cursor runs agents in cloud VMs. Claude Code ships with background execution. The pitch writes itself.
Fortune ran a piece last month calling the reality "far messier" than the promise. OpenClaw famously deleted a user's entire inbox because it ignored a pause instruction. And if you've spent any time with these tools on a real codebase — not a todo app, a real codebase — you already know the gap between demo and production is wide enough to drive a truck through.
So when I decided to take one of our own projects from zero test coverage to full API testing using AI agents, I wasn't chasing the dream. I was testing a hypothesis: what if the problem isn't the agents. It's the architecture around them?
Here's what happened.
The Project
Bowling Alleys IO (BAIO). One of our own projects. A Next.js + Express.js + Firebase monorepo with 40+ API endpoints and exactly zero tests. The kind of codebase where one bad merge breaks three features and nobody knows until a user reports it. I needed comprehensive test coverage, and I needed it fast.
The traditional path: block out 2–3 days, grind through endpoint after endpoint. The path I took: plan the work in structured tickets, hand them to an AI agent chain, and let it execute while I worked on other things.
Why Single-Agent Autonomy Doesn't Work
Here's where most people go wrong. They fire up a coding agent, paste in a vague prompt like "write tests for the API," and expect magic. What they get is an agent that writes tests against imaginary endpoints, imports from wrong paths, builds mock structures that don't match the actual database schema, and confidently tells you everything passed.
The problem compounds. A bad mock helper in file one means every subsequent test file inherits the same broken assumptions. By the time the agent finishes, you've got 200 lines of tests that look great and test nothing real.
Single-agent autonomy fails on real codebases because real codebases have context — conventions, file structures, dependency patterns, database schemas — that can't be captured in a single prompt. Mike Mason nailed it in his analysis: coherence through orchestration, not autonomy, is what actually works.
What I Actually Built
Instead of one agent doing everything, I built an orchestration layer that plans the work, dispatches it in structured chunks, validates between each step, and feeds the results forward. Here's the stack:
┌──────────────────────────────────┐
│ Cowork (Planning) │
│ Audits → Sprint plan → Tickets │
└───────────────┬──────────────────┘
│ structured tickets
▼
┌──────────────────────────────────┐
│ Claude Code (Dispatcher) │
│ Reads tickets → Launches agent │
│ Validates output → Sends next │
│ ticket → Updates knowledge base │
└──────────┬───────────┬───────────┘
│ │ notifications
│ ▼
│ ┌─────────────┐
│ │ Slack │
│ │ (Status │
│ │ updates) │
│ └─────────────┘
│
│ launchAgent / addFollowup
▼
┌──────────────────────────────────┐
│ Cursor Background Agent (Cloud) │
│ Writes code → Runs tests → │
│ Commits → Pushes │
│ Single session, full context │
└──────────────────────────────────┘
Cowork handles the planning phase. Project audits, testing strategy, sprint breakdown, and generating structured ticket JSON. Each ticket specifies exactly which files to create, which to read for reference, and which to modify. No ambiguity.
Claude Code is the dispatcher. It reads the ticket queue, launches the Cursor background agent, polls for completion, validates the output, and dispatches the next ticket. It also sends Slack DMs after each ticket so I can glance at my phone and know the status without context-switching.
Cursor Background Agents do the actual coding. They clone the repo into an isolated cloud VM, read the project context, write the code, run the tests, commit, and push.
The key architectural decision: one agent session, multiple tickets via followups. Instead of launching a new agent for each ticket (which loses all context) or running agents in parallel (merge conflict chaos), I use Cursor's addFollowup API. One launchAgent call kicks things off, then each subsequent ticket is sent as a followup to the same session. When Ticket 5 runs, the agent already knows about the mock helpers it built in Ticket 2. No re-bootstrapping, no context re-loading.
The Dispatcher's Secret Sauce
This is where the architecture earns its keep. Between each ticket, the dispatcher runs a two-phase validation:
Phase 1: Command check. Did npm test pass? If the agent says it's done but tests are failing, that's a hard stop.
Phase 2: Convention review. The dispatcher pulls the project's CONVENTIONS.md — a living document of patterns, file structures, and rules specific to this codebase — and checks whether the agent's output actually follows them. Did it use the right import paths? Did it scope its changes to the files specified in the ticket? Did it follow the mock structure established in earlier tickets?
Minor issues get logged as warnings. Major issues trigger a correction followup. The dispatcher tells the agent exactly what to fix and why. Unrecoverable failures trigger a session recovery: new session, full re-bootstrap from the knowledge base.
This is the guardrail layer that Fortune's article says is missing from most agent setups. And they're right. Without it, you're just rolling dice.
The Knowledge Loop
Here's what makes this a system and not just a script. After every chain run, the orchestration layer updates the project's knowledge base: execution status, a detailed chain report with per-ticket results, a dated changelog, the project context with sprint history, and the conventions file with any new patterns discovered during execution.
Each sprint's output becomes the next sprint's input context. The agent gets smarter about the project with every run. Patterns that had to be specified explicitly in Sprint 1 become conventions the agent follows automatically in Sprint 2.
The Numbers
No hand-waving. Here's what the system produced:
- 101 tests across 9 test files
- 40+ API endpoints covered (venues, reviews, auth, users, pricing, hubs, contracts)
- 8 tickets in the chain (1 manual setup, 7 automated)
- 0 failures in the full chain run
- ~1 hour of agent execution time for the automated tickets
- ~4 hours total human time (planning, setup, monitoring)
To be clear: this didn't happen "overnight" in the literal sense. The agent execution took about an hour. But that hour required zero manual intervention. The human time was front-loaded: planning the sprint, structuring the tickets, setting up the conventions file. Once the chain started running, I went and worked on something else.
The point isn't that it ran while I slept. The point is that the execution was fully automated, and the human effort shifted entirely to planning and review. For 8 tickets, an hour of execution is modest. The architecture scales. More tickets just means a longer run, not more human time.
The Skill Shift Nobody's Talking About
Here's the honest takeaway: I didn't write a single test. Not one. What I did was audit the codebase, design a testing strategy, break it into scoped tickets, define conventions, and build an orchestration system that could execute reliably.
That's a fundamentally different skill set. The industry is framing this as "AI replaces developers," but that's wrong. What it replaces is the mechanical act of translating a well-defined plan into code. What it demands is the ability to plan well, define scope precisely, write conventions that an agent can follow, and review output critically.
I'm not writing code anymore. I'm orchestrating systems that write code. And the developers who figure that out first are going to move absurdly fast.
Where This Is Heading
Right now, the coding agents run in Cursor's cloud. But the orchestration layer — the tickets, the dispatcher, the validation protocol, the knowledge loop — is infrastructure-agnostic. It runs wherever Claude Code runs.
The next step is a dedicated Mac Mini running OpenClaw — 5–7 watts idle, a couple bucks a month in electricity, silent operation. The same system, self-hosted, running on its own schedule. That's the endgame: a coding agent that lives on hardware you own, executing against a ticket queue you define, reporting results to your Slack.
But I'm not there yet. What I am doing is stress-testing this on a bigger project — roughly 3x the ticket volume. Same architecture, same orchestration pattern, significantly more scope. If the system holds up at scale the way it held up on BAIO, that's a different conversation entirely.
I'll be back with the results.