AI Agents · Orchestration · Testing

Orchestration, Not Autonomy: How I Took BAIO From Zero Tests to Full API Coverage

Every AI company sells the same dream. I tested a different hypothesis: what if the problem isn't the agents — it's the architecture around them? Here's what happened.

Gabriel Higareda · 14 min read · Development

Every AI company is selling you the same dream: describe what you want, go to sleep, wake up to working code. Cursor runs agents in cloud VMs. Claude Code ships with background execution. The pitch writes itself.

Fortune ran a piece last month calling the reality "far messier" than the promise. OpenClaw famously deleted a user's entire inbox because it ignored a pause instruction. And if you've spent any time with these tools on a real codebase — not a todo app, a real codebase — you already know the gap between demo and production is wide enough to drive a truck through.

So when I decided to take one of our own projects from zero test coverage to full API testing using AI agents, I wasn't chasing the dream. I was testing a hypothesis: what if the problem isn't the agents. It's the architecture around them?

Here's what happened.

The Project

Bowling Alleys IO (BAIO). One of our own projects. A Next.js + Express.js + Firebase monorepo with 40+ API endpoints and exactly zero tests. The kind of codebase where one bad merge breaks three features and nobody knows until a user reports it. I needed comprehensive test coverage, and I needed it fast.

The traditional path: block out 2–3 days, grind through endpoint after endpoint. The path I took: plan the work in structured tickets, hand them to an AI agent chain, and let it execute while I worked on other things.

Why Single-Agent Autonomy Doesn't Work

Here's where most people go wrong. They fire up a coding agent, paste in a vague prompt like "write tests for the API," and expect magic. What they get is an agent that writes tests against imaginary endpoints, imports from wrong paths, builds mock structures that don't match the actual database schema, and confidently tells you everything passed.

The problem compounds. A bad mock helper in file one means every subsequent test file inherits the same broken assumptions. By the time the agent finishes, you've got 200 lines of tests that look great and test nothing real.

Single-agent autonomy fails on real codebases because real codebases have context — conventions, file structures, dependency patterns, database schemas — that can't be captured in a single prompt. Mike Mason nailed it in his analysis: coherence through orchestration, not autonomy, is what actually works.

What I Actually Built

Instead of one agent doing everything, I built an orchestration layer that plans the work, dispatches it in structured chunks, validates between each step, and feeds the results forward. Here's the stack:

┌──────────────────────────────────┐
│         Cowork (Planning)        │
│  Audits → Sprint plan → Tickets  │
└───────────────┬──────────────────┘
                │ structured tickets
                ▼
┌──────────────────────────────────┐
│    Claude Code (Dispatcher)      │
│  Reads tickets → Launches agent  │
│  Validates output → Sends next   │
│  ticket → Updates knowledge base │
└──────────┬───────────┬───────────┘
           │           │ notifications
           │           ▼
           │    ┌─────────────┐
           │    │    Slack    │
           │    │  (Status    │
           │    │   updates)  │
           │    └─────────────┘
           │
           │ launchAgent / addFollowup
           ▼
┌──────────────────────────────────┐
│  Cursor Background Agent (Cloud) │
│  Writes code → Runs tests →      │
│  Commits → Pushes                │
│  Single session, full context    │
└──────────────────────────────────┘

Cowork handles the planning phase. Project audits, testing strategy, sprint breakdown, and generating structured ticket JSON. Each ticket specifies exactly which files to create, which to read for reference, and which to modify. No ambiguity.

Claude Code is the dispatcher. It reads the ticket queue, launches the Cursor background agent, polls for completion, validates the output, and dispatches the next ticket. It also sends Slack DMs after each ticket so I can glance at my phone and know the status without context-switching.

Cursor Background Agents do the actual coding. They clone the repo into an isolated cloud VM, read the project context, write the code, run the tests, commit, and push.

The key architectural decision: one agent session, multiple tickets via followups. Instead of launching a new agent for each ticket (which loses all context) or running agents in parallel (merge conflict chaos), I use Cursor's addFollowup API. One launchAgent call kicks things off, then each subsequent ticket is sent as a followup to the same session. When Ticket 5 runs, the agent already knows about the mock helpers it built in Ticket 2. No re-bootstrapping, no context re-loading.

The Dispatcher's Secret Sauce

This is where the architecture earns its keep. Between each ticket, the dispatcher runs a two-phase validation:

Phase 1: Command check. Did npm test pass? If the agent says it's done but tests are failing, that's a hard stop.

Phase 2: Convention review. The dispatcher pulls the project's CONVENTIONS.md — a living document of patterns, file structures, and rules specific to this codebase — and checks whether the agent's output actually follows them. Did it use the right import paths? Did it scope its changes to the files specified in the ticket? Did it follow the mock structure established in earlier tickets?

Minor issues get logged as warnings. Major issues trigger a correction followup. The dispatcher tells the agent exactly what to fix and why. Unrecoverable failures trigger a session recovery: new session, full re-bootstrap from the knowledge base.

This is the guardrail layer that Fortune's article says is missing from most agent setups. And they're right. Without it, you're just rolling dice.

The Knowledge Loop

Here's what makes this a system and not just a script. After every chain run, the orchestration layer updates the project's knowledge base: execution status, a detailed chain report with per-ticket results, a dated changelog, the project context with sprint history, and the conventions file with any new patterns discovered during execution.

Each sprint's output becomes the next sprint's input context. The agent gets smarter about the project with every run. Patterns that had to be specified explicitly in Sprint 1 become conventions the agent follows automatically in Sprint 2.

The Numbers

No hand-waving. Here's what the system produced:

To be clear: this didn't happen "overnight" in the literal sense. The agent execution took about an hour. But that hour required zero manual intervention. The human time was front-loaded: planning the sprint, structuring the tickets, setting up the conventions file. Once the chain started running, I went and worked on something else.

The point isn't that it ran while I slept. The point is that the execution was fully automated, and the human effort shifted entirely to planning and review. For 8 tickets, an hour of execution is modest. The architecture scales. More tickets just means a longer run, not more human time.

The Skill Shift Nobody's Talking About

Here's the honest takeaway: I didn't write a single test. Not one. What I did was audit the codebase, design a testing strategy, break it into scoped tickets, define conventions, and build an orchestration system that could execute reliably.

That's a fundamentally different skill set. The industry is framing this as "AI replaces developers," but that's wrong. What it replaces is the mechanical act of translating a well-defined plan into code. What it demands is the ability to plan well, define scope precisely, write conventions that an agent can follow, and review output critically.

I'm not writing code anymore. I'm orchestrating systems that write code. And the developers who figure that out first are going to move absurdly fast.

Where This Is Heading

Right now, the coding agents run in Cursor's cloud. But the orchestration layer — the tickets, the dispatcher, the validation protocol, the knowledge loop — is infrastructure-agnostic. It runs wherever Claude Code runs.

The next step is a dedicated Mac Mini running OpenClaw — 5–7 watts idle, a couple bucks a month in electricity, silent operation. The same system, self-hosted, running on its own schedule. That's the endgame: a coding agent that lives on hardware you own, executing against a ticket queue you define, reporting results to your Slack.

But I'm not there yet. What I am doing is stress-testing this on a bigger project — roughly 3x the ticket volume. Same architecture, same orchestration pattern, significantly more scope. If the system holds up at scale the way it held up on BAIO, that's a different conversation entirely.

I'll be back with the results.

Want to Build Systems That Scale?

I ship production-grade software and agent orchestration. Top 1% Expert-Vetted on Upwork. If you're building something that needs to move fast without breaking — hire me on Upwork and I'll tell you within 24 hours if I'm the right fit.

Hire me on Upwork