← All chaptersFrom Copilot to Colleague

Chapter 03 · 14 min read

Harnesses, Specs, and Codebases Agents Can Actually Use

How prepared environments make coding agents useful without accepting slop.

Experience in 3D
EVIDENCE OF SOURCE · CHAPTER 03
CHAPTER 03/3,214 words/Drafting

Chapter 3 — Harnesses, Specs, and Codebases Agents Can Actually Use

When coding agents disappoint, teams usually blame the model.

The model missed a dependency. The model misunderstood the architecture. The model refactored the right function but violated a local convention no outsider could have guessed. The model produced code that technically passed, yet somehow still felt wrong. In the postmortem, intelligence becomes the default suspect.

Sometimes that diagnosis is fair. Models do fail because they are weak, distracted, or simply not yet capable enough for the task. But in production codebases, that explanation is often too flattering to the humans involved. Many agent failures are not evidence that the model is hopeless. They are evidence that the environment was never made legible enough for delegated work.

Ryan Lopopolo puts the inversion bluntly: “The important thing is not the code but the prompt and the guardrails that got you there.” The line sounds almost rude in a field obsessed with generated output. But it captures one of the deepest shifts in AI-native engineering. Once you ask a machine to do implementation work instead of merely suggesting snippets, the surroundings become part of the product. The around the model starts determining what kind of work is even possible.

If you want AI to write production software, do not begin by asking what model to use. Begin by asking whether your repository, specs, validations, and workflow are structured well enough for a machine collaborator to operate without constant rescue.

A small software-factory vignette

Figure 03.1/The repo is the interfaceCLICK TO ENLARGE

Consider a team that starts where many teams now start: with a good model inside an ordinary repo.

At first the results feel magical. The agent writes tests faster than the humans expect. It handles small UI changes cleanly. It can even land a respectable refactor if a senior engineer hovers nearby and corrects its misunderstandings in real time. So the team expands the scope. They ask it to wire together a new endpoint, touch a migration, update a frontend state machine, and preserve some vague house style that nobody has ever written down.

Quality immediately gets erratic.

One patch uses a dependency the team would never approve. Another passes tests but ignores a performance convention learned the hard way six months earlier. A third is logically fine yet shaped in a way that makes review irritating and rollback risky. The team says the model is inconsistent. What they really mean is that the workplace is inconsistent.

So they change the workplace.

They add explicit setup scripts instead of Slack archaeology. They tighten lint and type gates. They create agent-facing instructions. They check in examples of accepted patterns. They write slimmer task specs before handing work off. They stop relying on “everyone kind of knows how we do migrations here.” The repo becomes less like a haunted archive of past decisions and more like a managed surface for machine labor.

That is the recurring case I will call the . It matters because it shows where the leverage sits. The breakthrough is not that the model suddenly became a genius. The breakthrough is that the team stopped treating its own tacit judgment as invisible infrastructure.

The repo is the real interface

Figure 03.2/Specs persistCLICK TO ENLARGE

Most coding tools still present themselves through chat. You type a request, maybe select a few files, and wait for the assistant to propose a patch. That interface is useful, but it can also be misleading. It suggests that the real problem lives in the prompt.

In practice, the prompt is only the visible tip of a much larger system.

The real interface is the codebase plus everything around it: the setup instructions, the architecture, the naming conventions, the tests, the lint rules, the examples of good patches, the traces of previous reviews, the ADRs, the failure cases, and the rules about performance or security that no single file states explicitly. A human engineer entering a mature repository absorbs these constraints slowly. They ask teammates what matters. They notice which patterns recur. They learn what kinds of changes get approved quickly and which ones trigger suspicion.

An agent does not get that apprenticeship unless the team builds it.

Lopopolo makes the obligation explicit: “Your job is to build systems, software and structures that enable your team to be successful. And to do that, we need to make them legible to those agents that are driving the implementation.” That sentence is more radical than it first appears. It means the team is no longer only maintaining software for other humans. It is also maintaining a working environment for machine contributors.

That is why documentation, ADRs, examples, and historical breadcrumbs matter so much. Those are not decorative artifacts around the “real” software process. In an AI-native workflow, they become part of the execution environment itself. If human expectations live only in scattered memory, the agent cannot inherit them. If the rules of good work are mostly tacit, the agent will violate them in ways that look mysterious only because the team never externalized its own standards.

This is also why the practical unit of AI coding is no longer the snippet. It is the codebase. Naman Jain describes the shift cleanly: “My first project was actually working on generating single line... snippets and my last project was generating an entire codebase.” Once the unit of work becomes the repository, environment design stops being background hygiene and becomes first-order leverage.

The core mistake many teams make is to treat code generation as the primary problem and repo legibility as a secondary concern. In reality, the second often dominates the first — so the diagnostic move when an agent disappoints is to invert the blame: before swapping models, ask what a new senior hire would have had to ask a teammate to do this task safely, then check whether that answer exists anywhere the agent can read it. A capable model dropped into a murky repository is like a strong engineer dropped into an organization with no onboarding, inconsistent standards, and no access to prior decisions. You can still get lucky. You cannot count on it.

Good code contains hundreds of unstated decisions

Figure 03.3/The software factoryCLICK TO ENLARGE

One reason this problem is easy to underestimate is that experienced engineers are bad at seeing their own tacit judgment. Good code does not differ from bad code only in correctness. It differs in tone, proportion, naming, performance discipline, dependency choices, rollback safety, test shape, compatibility assumptions, reviewability, and fit with the broader system.

Lopopolo gives this problem a memorable scale when he says that producing a single patch can require “500 little decisions” around underspecified non-functional requirements. The exact number is not the point. The point is that repositories are dense with decisions that matter greatly but are rarely captured in the task description. A human engineer often fills those gaps through craft and context. A coding agent fills them through inference under uncertainty.

That is where slop comes from.

The sloppy patch is not always the sign of a stupid model. It is often the sign of a task whose invisible success criteria were never written down. The agent guessed because the environment forced it to. Then humans act surprised that the guesses look generic.

Once you see the problem this way, the prescription changes. The answer is not only “prompt better.” It is to reduce the amount of silent guesswork that the environment demands. Externalize architecture choices. Store examples of accepted patterns. Make non-functional constraints explicit. Give the system stable ways to discover how this team expects software to be built.

That is what means. Not a fancier wrapper around a model, but the systematic conversion of tacit engineering judgment into durable, machine-usable constraints.

This is also where Chapter 2 should still be echoing in the reader’s mind. Cheap generation raised the value of judgment. Chapter 3 is where that judgment stops living only inside senior people and starts getting encoded into the environment.

Specs are not paperwork; they are executable intent

This is where becomes more than a documentation preference.

In a purely human workflow, specs often compete with direct conversation. A strong team can get away with more ambiguity because engineers resolve a surprising amount through meetings, hallway discussions, pull-request comments, and local intuition. In an AI-mediated workflow, that ambiguity becomes more expensive. The failure mode is relying on chat to carry intent: context windows expire, tasks get retried, work gets decomposed into subproblems, and different agents touch different layers of the same system — so any intent that lives only in transient conversation dissolves the moment one of those events fires. The operational test is whether a piece of intent would survive a fresh agent picking up the task cold; if it would not, it belongs in a persistent artifact, not the prompt.

Al Harris offers one of the clearest framings in the corpus: “The spec then becomes the natural language representation of your system. It has constraints, it has concerns around functional requirements, non-functional requirements...” That framing matters because it upgrades the spec from document to control surface.

Harris makes a second point that is just as important: is “a structured workflow that we push you through to reliably deliver high-quality software... requirements, design, and execution phases.” In other words, the spec is not a memo attached to the work. It is part of the workflow that shapes the work.

A useful spec in an AI-native environment does several jobs at once. It records what problem is being solved. It states constraints the agent should not have to rediscover through trial and error. It makes room for non-functional requirements that are otherwise easy to drop. It persists across retries and handoffs. And it creates a shared object that humans and machines can both inspect.

Seen this way, specs are a form of context compression. They take sprawling intent that would otherwise live in chat history, tribal knowledge, or remembered conversation and package it into a stable artifact the workflow can keep returning to. They also make evaluation easier, because a system with a concrete spec can be judged against a clearer notion of success than one that began with a vague prompt and hope.

This does not mean every ticket needs an elaborate design doc. Over-specification can absolutely collapse exploration into bureaucracy, and small or exploratory work is best discovered through fast iteration — writing a heavy spec there is the wrong choice. The test for when to reach for one: the moment the cost of misunderstanding rises — because the task is large, parallelized across agents, safety-sensitive, or expensive to review — write the spec before generating. Below that line, prompt and iterate; above it, externalize intent first, because that is where a stable representation of intent becomes leverage rather than ceremony.

The deeper point is that matters more, not less, in an era of powerful coding agents. The stronger the generator, the more valuable a stable representation of intent becomes.

Agent-ready codebases are designed, not discovered

Eno Reyes is especially useful here because he connects old-fashioned engineering hygiene to an AI-native operating model. He begins with a deliberately basic question: “Do you have some automated validation for the format of your code?... for professional software engineers [it's] like, yeah, of course we do.” Then comes the important turn: “But I think you can go a step further.”

That extra step is the real substance of agent-readiness. The question is not simply whether a team has linting or tests. It is whether the codebase has enough automated validation and explicit structure that a coding agent can move through it with bounded risk. A repository becomes agent-ready when it exposes enough of its standards, setup, and quality gates that delegated work becomes legible.

A practical checklist usually includes at least the following:

  • a stable folder structure rather than a maze of historical accidents
  • explicit setup, build, and run commands that do not rely on oral tradition
  • strong type, lint, and test gates the agent can run repeatedly
  • architecture decisions stored in files instead of buried in memory
  • examples of accepted patterns for tests, APIs, migrations, and reviews
  • specs or task briefs stored close enough to the work that they survive handoff
  • narrower tools or scripts for common operations where free-form shell access is unnecessary

None of this is glamorous. That is exactly why it matters. The biggest gains in coding agents often come not from frontier prompting technique but from reducing avoidable ambiguity in the repository itself.

There is also an organizational benefit hiding inside this technical one. Better repositories do not only help machines. They help weaker humans, new hires, cross-functional contributors, and future maintainers. In that sense, “agent-ready” is not some alien new standard imposed by AI. It is a sharper test of whether the team actually encoded its own expectations in reusable form.

The agent is exposing the difference between standards the team possesses and standards the team can operationalize.

The harness is a workflow, not just a wrapper

It is tempting to imagine the as a thin layer around a model: maybe a system prompt, a tool list, a sandbox, and a few guardrails. That is too narrow.

A real includes environment setup, repository policy, validation steps, task decomposition, review surfaces, memory of prior work, failure handling, and the sequence in which all those things are applied. In other words, it is a workflow.

This is where the software-factory metaphor becomes useful. Eric Zakariasson talks directly about “building your own ,” and the phrase lands because it redirects attention from one-off generation to staged production. A factory has specifications, stations, checks, feedback loops, and manager-visible status. It does not assume that every worker can safely improvise in every direction.

Zakariasson makes a subtler point too: the factory itself needs a spec. “To set the spec for the factory,” he says, you would likely have a folder in the codebase with markdown guidance, best practices, and rules. That is exactly the conceptual move this chapter needs. The has to be encoded somewhere. Once it is checked into the repo, process stops living only in human habit and becomes part of the codebase’s working surface.

This is also where the book’s middle spine starts locking together. Once you think in , you immediately care about evaluation and runtime semantics. A that cannot measure quality is incomplete. A that cannot preserve state across longer tasks is fragile. Chapter 3 does not end the argument. It hands the book directly into Chapter 4.

Subagents and specialization belong to the harness

The same logic extends beyond a single agent.

One of the most interesting developments in modern coding systems is the move toward specialized roles: research agents, review agents, refactor agents, debugging agents, and broader subagent frameworks that let a larger task be decomposed into parallel, semi-independent work. OpenAI has demonstrated the pattern directly: in one workshop, Codex spun up parallel sub-agents to review a directory of files at once, each sub-agent assigned its own model, sandbox mode, and tool access, with an orchestrator partitioning the work and collating the results.

The key insight is not merely that parallelism can make things faster. It is that specialization makes process explicit.

When a team creates a dedicated review agent, a repo-auditing agent, or a migration-focused agent with narrow tools and instructions, it is encoding judgment about how work should be done. The role itself becomes part of the . This mirrors what strong human organizations already do. They do not treat every task as a blank slate executed by a generalist. They create roles, review structures, and bounded responsibilities so judgment can scale.

But subagents also intensify the need for good scaffolding. More workers without a stronger do not create a factory. They create chaos faster. The mistake to watch for: adding parallel agents before you can recompose, inspect, and evaluate their output — at that point you have multiplied generation without multiplying the checks, and the throughput is illusory. The rule of thumb is to add a subagent role only once the recomposition and review surface for its output already exists. That means subagents are not an argument against . They are evidence that is becoming more important.

The new advantage is environment design

The marketing of AI coding tools naturally focuses on generation. That is the visible magic. The agent edits a file. It writes a test. It proposes a patch. Those moments are real and often impressive.

But the durable advantage is increasingly elsewhere.

It belongs to teams that make their repositories legible. Teams that externalize non-functional judgment instead of leaving it trapped in senior engineers’ heads. Teams that treat specs as reusable intent rather than ceremonial paperwork. Teams that invest in validations and repo affordances that help an agent check its own work. Teams that gradually turn loose process into a staged, inspectable .

This is why deserves to be treated as a primary discipline instead of a tactical trick. The is not a helper around the codebase. It is becoming part of the codebase.

The winners in AI coding will not simply be the teams with access to the strongest models. They will be the teams that built workplaces those models can actually understand.

That is the bridge into the next chapter. Once the environment can produce delegated work at all, the obvious next question is no longer how to generate more. It is how to know whether the generated work is actually good.

What to do with this

  • When an agent disappoints, invert the blame before swapping models: ask whether your repository, specs, validations, and workflow are structured well enough for a machine collaborator to operate without constant rescue. Treat the unexplained "sloppy patch" as a missing success criterion, not a stupid model.
  • Run the agent-readiness checklist on one repo this week: a stable folder structure, explicit setup/build/run commands, strong type/lint/test gates the agent can run repeatedly, architecture decisions stored in files, examples of accepted patterns for tests/APIs/migrations/reviews, and specs stored close enough to the work to survive handoff.
  • Stop relying on "everyone kind of knows how we do migrations here." Externalize the tacit standards: check in examples of accepted patterns, write down non-functional constraints, and replace Slack archaeology with setup scripts the agent can read.
  • Gate spec-writing by cost of misunderstanding, not by habit. For small or exploratory work, prompt and iterate; once a task is large, parallelized, safety-sensitive, or expensive to review, write the spec first — with constraints and non-functional requirements — so it persists across retries and handoffs.
  • Apply the survival test to intent: if a fresh agent picking up the task cold would lose a piece of intent because context expired or work was decomposed, move that intent out of chat and into a persistent artifact.
  • Encode the factory's own rules into the repo. Create a folder of markdown guidance, best practices, and rules so process stops living only in human habit, and add a specialized subagent role only once the surface to recompose and review its output already exists.

10 claims · 33 source anchors

Evidence — Source Anchors

Reliability comes less from model cleverness than from surrounding scaffolding

  • The important thing is not the code but the prompt and the guardrails that got you there.
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • Agents have intelligence and capabilities, but not always expertise that we need for real work.
    #83 — Barry Zhang & Mahesh Murag, Anthropicconfidence: high
  • these are three kind of like ingredients which are pretty simple and pretty basic, but I think provide an interesting kind of like first principles approach for how to think about
    #198 — Harrison Chase, LangChain/LangGraphconfidence: high

Harness quality is a major determinant of coding-agent quality

  • a good harness is really operationalized around giving the model text at the right time
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • there's so much work that has been put in uh over the last you know 20 to 30 years around the automated validation and verification of software that you build
    #57 — Eno Reyes, Factory AIconfidence: high
  • instead of micromanaging, what I'm doing is I'm scaffolding and providing context.
    #190 — Eric Hou, Augment Codeconfidence: high
  • identifying problems with the code because if there's no problems then it's probably high quality code
    #179 — Josh Albrecht, Imbueconfidence: high

Specs are not paperwork; they are executable intent

  • specs are natural language, you're using specs as a control surface to explain what you want the system to do.
    #40 — Al Harris, Amazon Kiroconfidence: high
  • leaving breadcrumbs, documentation, ADRs, persona oriented documentation around what a good job looks like.
    #16 — Ryan Lopopolo, OpenAIconfidence: high

The practical unit of AI coding is the codebase, not the snippet

  • snippets and my last project was generating an entire codebase.
    #72 — Naman Jain, Cursorconfidence: high
  • agents MD files an open standard
    #57 — Eno Reyes, Factory AIconfidence: high
  • codebase for harness engineering
    #16 — Ryan Lopopolo, OpenAIconfidence: high

Agent-ready codebases are designed, not discovered

  • agents MD files an open standard
    #57 — Eno Reyes, Factory AIconfidence: high
  • context deficit as the biggest blocker.
    #190 — Eric Hou, Augment Codeconfidence: high
  • a garbage codebase you're going to get
    #621 — Matt Pocockconfidence: high

The harness is evolving from a local loop into a staged software factory

  • getting to a place where you can build your own like software factory
    #629 — Eric Zakariasson, Cursorconfidence: high
  • unified agent harness that will manage
    #632 — Vaibhav Srivastav & Katia Gil Guzman, OpenAIconfidence: high
  • parallel agents working together to fix
    #42 — Robert Brennan, OpenHandsconfidence: high
  • The difference with missions is that we run features serially.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • Our longest mission ran for 16 days
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • We just kind of gave each role its own kind of context window.
    #691 — Ash Prabaker & Andrew Wilson, Anthropicconfidence: high
  • it's no longer about the model or the agent. It's about the process.
    #743 — Vincent Koc, OpenClawconfidence: high

Harness quality now includes capability packaging, not only repo hygiene

  • That's what a skill is. You're teaching the the LLM how to do something in the way that you expect it to be done
    #654 — Nick Nisi & Zack Proser, WorkOSconfidence: high
  • This is how the agent is
    #683 — Pedro Rodrigues, Supabaseconfidence: high
  • 49% reduction of the initial
    #625 — Sam Morrow, GitHubconfidence: high
  • the schema is the UI for the agent.
    #744 — Michael Hablich, Google (Chrome DevTools)confidence: high

Coordination is the unsolved runtime primitive for multi-agent systems

  • the thing that's missing for me is coordination.
    #704 — Lou Bichard, Onaconfidence: high
  • through sort of state machines, you know, by building out workflows and effectively state machines
    #704 — Lou Bichard, Onaconfidence: high
  • They step on each other's changes. They duplicate work. They make inconsistent architectural decisions.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • we have the two agents basically negotiate what done actually means.
    #691 — Ash Prabaker & Andrew Wilson, Anthropicconfidence: high

Coding agents expose the gap between standards a team possesses and standards it can operationalize

  • there's so much work that has been put in uh over the last you know 20 to 30 years around the automated validation and verification of software that you build
    #57 — Eno Reyes, Factory AIconfidence: high
  • specs are natural language, you're using specs as a control surface to explain what you want the system to do.
    #40 — Al Harris, Amazon Kiroconfidence: high

Subagent specialization makes process explicit and encodes team judgment into roles

  • a good harness is really operationalized around giving the model text at the right time
    #16 — Ryan Lopopolo, OpenAIconfidence: high