Harnesses for Agent-Ready Code — From Copilot to Colleague

CHAPTER 03/3,313 words/Drafting

Source-anchored claims

1Reliability comes less from model cleverness than from surrounding scaffolding
2Harness quality is a major determinant of coding-agent quality
3Specs are not paperwork; they are executable intent
4The practical unit of AI coding is the codebase, not the snippet
5Agent-ready codebases are designed, not discovered
6The harness is evolving from a local loop into a staged software factory
7Harness quality now includes capability packaging, not only repo hygiene
8Coordination is the unsolved runtime primitive for multi-agent systems
9Coding agents expose the gap between standards a team possesses and standards it can operationalize
10Subagent specialization makes process explicit and encodes team judgment into roles
11Once agents go parallel and autonomous, the human's verification capacity — not the agents' generation capacity — is the binding constraint
12Agents fabricate having verified — they report success they never achieved — so the harness must supply real verification, not trust the agent's account of it
13Parallel agents need per-agent runtime isolation — a sandbox/micro-VM/worktree each — because containers are not a sufficient boundary for agent-generated code

When coding agents disappoint, teams usually blame the model.

The model missed a dependency. The model misunderstood the architecture. The model refactored the right function but violated a local convention no outsider could have guessed. The model produced code that technically passed, yet somehow still felt wrong. In the postmortem, intelligence becomes the default suspect.

Sometimes that diagnosis is fair. Models do fail because they are weak, distracted, or simply not yet capable enough for the task. But in production codebases, that explanation is often too flattering to the humans involved. Many agent failures are not evidence that the model is hopeless. They are evidence that the environment was never made legible enough for delegated work.

Ryan Lopopolo puts the inversion bluntly: “The important thing is not the code but the prompt and the guardrails that got you there.” The line sounds almost rude in a field obsessed with generated output. But it captures one of the deepest shifts in AI-native engineering. Once you ask a machine to do implementation work instead of merely suggesting snippets, the surroundings become part of the product. The around the model starts determining what kind of work is even possible.

This chapter’s is simple: if you want AI to write production software, do not begin by asking what model to use. Begin by asking whether your repository, specs, validations, and workflow are structured well enough for a machine collaborator to operate without constant rescue.

A small software-factory vignette

Figure 03.1/The repo is the interfaceCLICK TO ENLARGE

Meridian started where many teams now start: with a good model inside an ordinary payments repo — the the opening chapter promised, before it had earned the name. At first the results felt magical. The agent writes tests faster than the humans expect. It handles small UI changes cleanly. It can even land a respectable refactor if a senior engineer hovers nearby and corrects its misunderstandings in real time. So the team expands the scope. They ask it to wire together a new endpoint, touch a migration, update a frontend state machine, and preserve some vague house style that nobody has ever written down.

Quality immediately gets erratic. One patch uses a dependency the team would never approve. Another passes tests but ignores a performance convention learned the hard way six months earlier. A third is logically fine yet shaped in a way that makes review irritating and rollback risky. The team says the model is inconsistent. What they really mean is that the workplace is inconsistent.

So they change the workplace. They add explicit setup scripts instead of Slack archaeology. They tighten lint and type gates. They create agent-facing instructions. They check in examples of accepted patterns. They write slimmer task specs before handing work off. They stop relying on "everyone kind of knows how we do migrations here." The repo becomes less like a haunted archive of past decisions and more like a managed surface for machine labor.

Take one Tuesday from what the team would later call the slop era. A senior engineer hands the agent a migration: move three endpoints onto the new state machine, and keep the house style. The agent comes back in minutes with a clean-looking diff. It compiles. The tests pass. But it has pulled in a parsing dependency the team had quietly banned, and it batches writes inside a loop — exactly the pattern a payments incident taught them to avoid six months earlier. Neither rule was written anywhere the agent could read. The reviewer catches the dependency, misses the write pattern, and approves. The convention resurfaces later, in a review comment nobody will search again. None of it was a hallucination. The agent guessed because the workplace gave it nothing firmer to stand on. The next morning the team does the unglamorous work: a checked-in example of an accepted migration, a lint rule for the banned import, a setup script that ends the Slack archaeology.

That is the recurring case this book will call the . It matters because it shows where the leverage really sits. The breakthrough is not that the model suddenly became a genius, but that the team stopped treating its own tacit judgment as invisible infrastructure.

The repo is the real interface

Figure 03.2/Specs persistCLICK TO ENLARGE

Most coding tools still present themselves through chat. You type a request, maybe select a few files, and wait for the assistant to propose a patch. That interface is useful, but it misleads by suggesting the real problem lives in the prompt. What actually determines the work is the environment around it — and unlike a prompt, an environment can be audited and improved.

In practice, the prompt is only the visible tip of a much larger system. The real interface is the codebase plus everything around it: the setup instructions, the architecture, the naming conventions, the tests, the lint rules, the examples of good patches, the traces of previous reviews, the ADRs, the failure cases, and the rules about performance or security that no single file states explicitly. A human engineer entering a mature repository absorbs these constraints slowly. They ask teammates what matters. They notice which patterns recur. They learn what kinds of changes get approved quickly and which ones trigger suspicion. An agent does not get that apprenticeship unless the team builds it.

Lopopolo makes the obligation explicit: “Your job is to build systems, software and structures that enable your team to be successful. And to do that, we need to make them legible to those agents that are driving the implementation.” That sentence is more radical than it first appears. It means the team is no longer only maintaining software for other humans. It is also maintaining a working environment for machine contributors. A cheap test for legibility: clone the repo into a fresh container and time how long an agent takes to reach a green test run. If it cannot — because step four lives in someone's shell history — that is a hole where the agent will guess.

That is why documentation, ADRs, examples, and historical breadcrumbs matter so much. Those are not decorative artifacts around the “real” software process; in an AI-native workflow, they are part of the execution environment itself. The rule underneath is unforgiving: a standard that lives only in scattered memory is one the agent cannot inherit, and will therefore break — in ways that look mysterious only because the team never externalized the standard the agent broke.

This is also why the practical unit of AI coding is no longer the snippet but the codebase. Naman Jain describes the shift cleanly: “My first project was actually working on generating single line... snippets and my last project was generating an entire codebase.” Once the unit of work becomes the repository, environment design stops being background hygiene and becomes first-order leverage.

The core mistake many teams make is to treat code generation as the primary problem and repo legibility as a secondary concern. In reality, the second often dominates the first. A capable model dropped into a murky repository is like a strong engineer dropped into an organization with no onboarding, inconsistent standards, and no access to prior decisions. You can still get lucky. You cannot count on it.

Good code contains hundreds of unstated decisions

Figure 03.3/The software factoryCLICK TO ENLARGE

One reason this problem is easy to underestimate is that experienced engineers are bad at seeing their own tacit judgment. Good code does not differ from bad code only in correctness. It differs in tone, proportion, naming, performance discipline, dependency choices, rollback safety, test shape, compatibility assumptions, reviewability, and fit with the broader system.

Lopopolo gives this problem a memorable scale when he says that producing a single patch can require “500 little decisions” around underspecified non-functional requirements. The exact number is not the point. The point is that repositories are dense with decisions that matter greatly but are rarely captured in the task description. A human engineer often fills those gaps through craft and context. A coding agent fills them through inference under uncertainty. Lopopolo names the mechanism: the models “during their training have seen trillions of lines of code that make every possible choice of those non-functional requirements that you could ever imagine.” Left a requirement unspecified, it samples one of those conventions, and nothing makes the sample yours.

That is where slop comes from.

The sloppy patch is not always the sign of a stupid model. It is often the sign of a task whose invisible success criteria were never written down. The agent guessed because the environment forced it to. Then humans act surprised that the guesses look generic.

Once you see the problem this way, the prescription changes. The answer is not only “prompt better”; it is to reduce the amount of silent guesswork that the environment demands. Externalize architecture choices. Store examples of accepted patterns. Make non-functional constraints explicit. Give the system stable ways to discover how this team expects software to be built.

That is what really means. Not a fancier wrapper around a model, but the systematic conversion of tacit engineering judgment into durable, machine-usable constraints. Lopopolo describes the raw materials plainly — “leaving breadcrumbs, documentation, ADRs, persona oriented documentation around what a good job looks like” — and encoding them once buys reuse: have one engineer write down what a good QA plan looks like, and “every agent trajectory is going to get a good QA plan.”

Specs are not paperwork; they are executable intent

This is where becomes more than a documentation preference. In a purely human workflow, specs often compete with direct conversation. A strong team can get away with more ambiguity because engineers resolve a surprising amount through meetings, hallway discussions, pull-request comments, and local intuition. In an AI-mediated workflow, that ambiguity becomes more expensive. Context windows expire. Tasks get retried. Work gets decomposed into subproblems. Different agents touch different layers of the same system. Without persistent artifacts, intent keeps dissolving back into transient conversation. The operational test is whether intent would survive a fresh agent picking up the task cold; if not, it belongs in a persistent artifact, not the prompt.

Al Harris offers one of the clearest framings in the corpus: “The spec then becomes the natural language representation of your system. It has constraints, it has concerns around functional requirements, non-functional requirements...” That framing matters because it upgrades the spec from document to control surface. Treating it that way has a concrete consequence: when the output is wrong, you fix the line in the spec that under-specified it and regenerate rather than hand-patch the output. Repeatedly editing code the next run overwrites is the signal that the spec, not the patch, is the thing to change.

Harris makes a second point that is just as important: is “a structured workflow that we push you through to reliably deliver high-quality software... requirements, design, and execution phases.”

A useful spec in an AI-native environment does several jobs at once. It records what problem is being solved. It states constraints the agent should not have to rediscover through trial and error. It makes room for non-functional requirements that are otherwise easy to drop. It persists across retries and handoffs. And it creates a shared object that humans and machines can both inspect.

Seen this way, specs are a form of context compression. They take sprawling intent that would otherwise live in chat history, tribal knowledge, or remembered conversation and package it into a stable artifact the workflow can keep returning to. They also make evaluation easier, because a system with a concrete spec can be judged against a clearer notion of success than one that began with a vague prompt and hope.

This does not mean every ticket needs an elaborate design doc. Over-specification can just as easily collapse exploration into bureaucracy, and some work is best discovered through fast iteration. The test is not how important the work feels but how expensive a misunderstanding would be: once the task is large, parallelized, safety-sensitive, or costly to review, explicit intent stops being ceremony and becomes leverage.

The deeper point is that matters more, not less, in an era of powerful coding agents. The stronger the generator, the more valuable a stable representation of intent becomes.

Agent-ready codebases are designed, not discovered

At this point the argument can sound abstract, so it helps to come back to repository mechanics. Eno Reyes is especially useful here because he connects old-fashioned engineering hygiene to an AI-native operating model. He begins with a deliberately basic question: “Do you have some automated validation for the format of your code?... for professional software engineers [it's] like, yeah, of course we do.” Then comes the important turn: “But I think you can go a step further.”

That extra step is the real substance of agent-readiness. The question is not simply whether a team has linting or tests, but whether the codebase has enough automated validation and explicit structure that a coding agent can move through it with bounded risk. A repository becomes agent-ready when it exposes enough of its standards, setup, and quality gates that delegated work becomes legible. The step further is that the validation surface is itself something the agent can extend: you can “ask a coding agent, could you figure out where we're not being opinionated enough about our linters” and have it write the missing rule. Reyes sets a low bar on purpose — “a slop test is better than no test” — because once a rough check exists, the next agent follows it and the rules ratchet tighter.

A practical checklist usually includes at least the following:

a stable folder structure rather than a maze of historical accidents
explicit setup, build, and run commands that do not rely on oral tradition
strong type, lint, and test gates the agent can run repeatedly
architecture decisions stored in files instead of buried in memory
examples of accepted patterns for tests, APIs, migrations, and reviews
specs or task briefs stored close enough to the work that they survive handoff
narrower tools or scripts for common operations where free-form shell access is unnecessary

None of this is glamorous. That is exactly why it matters. The biggest gains in coding agents often come not from frontier prompting technique but from reducing avoidable ambiguity in the repository itself. Validation also caps how ambitious delegation can get: Reyes is explicit that you cannot fan out parallel agents until single-task execution succeeds nearly 100 percent of the time — if you cannot tell automatically whether one change is safe, running twenty only multiplies unverified output.

There is also an organizational benefit hiding inside this technical one. Better repositories do not only help machines. They help weaker humans, new hires, cross-functional contributors, and future maintainers. In that sense, “agent-ready” is not some alien new standard imposed by AI. It is a sharper test of whether the team actually encoded its own expectations in reusable form. The agent is exposing the difference between standards the team possesses and standards the team can operationalize. ## The is a workflow, not just a wrapper

It is tempting to imagine the as a thin layer around a model: maybe a system prompt, a tool list, a sandbox, and a few guardrails. That is too narrow.

A real includes environment setup, repository policy, validation steps, task decomposition, review surfaces, memory of prior work, failure handling, and the sequence in which all those things are applied. In other words, it is a workflow. Lopopolo's working definition names what ties those pieces together: “a good is really operationalized around giving the model text at the right time.” The unifying job is timing and selection — which file, rule, or example reaches the model at the step it needs it.

This is where the software-factory metaphor becomes useful. Eric Zakariasson talks directly about “building your own ,” and the phrase lands because it redirects attention from one-off generation to staged production. A factory has specifications, stations, checks, feedback loops, and manager-visible status. It does not assume that every worker can safely improvise in every direction.

Zakariasson makes a subtler point too: the factory itself needs a spec. “To set the spec for the factory,” he says, you would likely have a folder in the codebase with markdown guidance, best practices, and rules. That is exactly the conceptual move this chapter needs. The has to be encoded somewhere. Once it is checked into the repo, process stops living only in human habit and becomes part of the codebase’s working surface.

Once you think in , you immediately care about evaluation and runtime semantics. A that cannot measure quality is incomplete. A that cannot preserve state across longer tasks is fragile.

Subagents and specialization belong to the harness

The same logic extends beyond a single agent. One of the most interesting developments in modern coding systems is the move toward specialized roles: research agents, review agents, refactor agents, debugging agents, and broader subagent frameworks that let a larger task be decomposed into parallel, semi-independent work. OpenAI’s Codex materials describe subagents as “the ability wherein you can spin off a master task into decomposable parallel and independent tasks.”

The key insight is not merely that parallelism can make things faster. It is that specialization makes process explicit.

When a team creates a dedicated review agent, a repo-auditing agent, or a migration-focused agent with narrow tools and instructions, it is encoding judgment about how work should be done. The role itself becomes part of the . This mirrors what strong human organizations already do. They do not treat every task as a blank slate executed by a generalist. They create roles, review structures, and bounded responsibilities so judgment can scale.

But subagents also intensify the need for good scaffolding. More workers without a stronger do not create a factory. They create chaos faster. Parallel output is only valuable if the pieces can be recomposed, inspected, and evaluated. That means subagents are not an argument against but evidence that it is becoming more important.

The new advantage is environment design

The marketing of AI coding tools naturally focuses on generation. That is the visible magic. The agent edits a file. It writes a test. It proposes a patch. Those moments are real and often impressive.

But the durable advantage is increasingly elsewhere. It belongs to teams that make their repositories legible. Teams that externalize non-functional judgment instead of leaving it trapped in senior engineers’ heads. Teams that treat specs as reusable intent rather than ceremonial paperwork. Teams that invest in validations and repo affordances that help an agent check its own work. Teams that gradually turn loose process into a staged, inspectable . Reyes puts a specific number on the stakes: this investment is where “the real like 5x, 6x, 7x comes from,” and the catch is that “it's a choice that you as an organization have” — the model will not hand it to you.

This is why deserves to be treated as a primary discipline instead of a tactical trick. The is not a helper around the codebase. It is becoming part of the codebase.

And that may turn out to be one of the most important shifts in software engineering culture. The winners in AI coding will not simply be the teams with access to the strongest models. They will be the teams that built workplaces those models can actually understand.

Once the environment can produce delegated work at all, the obvious next question is no longer how to generate more. It is how to know whether the generated work is actually good.

Harnesses, Specs, and Codebases Agents Can Actually Use