← All chaptersFrom Copilot to Colleague

Chapter 04 · 11 min read

Evals Are the Control System

Why production trust comes from measurement loops, not vibes.

Experience in 3D
EVIDENCE OF SOURCE · CHAPTER 04
CHAPTER 04/2,398 words/Drafting

Chapter 4 — Evals Are the Control System

The obvious failure mode of AI is that it can be wrong. The more dangerous one is that it can look right often enough that the team stops measuring.

A demo works twice. A prototype feels sharp. A coding agent lands a decent patch. A support assistant answers a handful of questions convincingly, and everyone starts speaking in the language of vibes — the system feels promising, maybe even close to ready. That is exactly where the trouble begins, because a promising feeling is not a control loop. Production trust comes from the ability to compare versions, catch regressions, preserve hard-won lessons, and measure whether the system still works when real users, real data, and real edge cases arrive. That is what evals are for, and this chapter argues they are not a side practice you bolt on before launch. They are the operating system of a system you intend to trust.

This is the same argument Chapter 2 made about review, now mechanized. There, judgment under abundance was a human posture; here it becomes an instrument. Once a system does delegated work, you cannot eyeball every output, and the discipline that kept cheap generation honest has to become something you can run.

The unit of evaluation changed

Figure 04.1/Evals are not unit testsCLICK TO ENLARGE

For most of the short history of AI evaluation, the unit was small: a single completion, a one-line answer, a snippet judged in isolation. That worked while the systems themselves were small. It stops working the moment the system's job grows.

Naman Jain, who builds at Cursor, names the shift in his own work: "coding capabilities have leapt from generating one-line snippets to completing entire codebases with agentic workflows." When the deliverable was a snippet, you could grade the snippet. When the deliverable is a multi-file change across a real repository, grading the diff line by line tells you almost nothing about whether the system did the job. The unit of evaluation has to grow to match the unit of work. A codebase change, a multi-step workflow, a retrieval-heavy research task — each has to be judged at the level it operates, not at the level that happens to be easy to score.

This is why the chapter's title insists on control system rather than test suite. A test suite checks fixed assertions about small units. A control system steers a large, drifting process toward a goal over time. The distinction is not pedantic: the choice of unit determines what you build, what you measure, and whether your measurement survives contact with a system that does real work.

Evals are not unit tests — and also are

Figure 04.2/The unit of evaluationCLICK TO ENLARGE

Ido Pesok, working on evals at Vercel's v0, gave a talk with a deliberately blunt title: "Evals are not unit tests." His point is a reasoning posture. A unit test encodes a binary fact — the function returns 4, or it is broken. An eval rarely encodes a fact like that. It encodes a judgment about quality, helpfulness, safety, or fit, and treating a pass/fail eval score as if it were a unit test's green checkmark quietly smuggles in a certainty the measurement does not have. Application-layer evals are messy because reality is messy: users, latency, cost, policy, and workflow constraints all bear on whether an output was actually good, and none of them reduce cleanly to true or false.

It is worth holding that against a practitioner who says the opposite. Lawrence Jones at incident.io calls them, flatly, "AI unit tests" — and stores them as YAML files next to the prompts they grade. The disagreement looks sharp and is actually productive, because the two are describing different surfaces of the same artifact. Pesok is describing the human reasoning stance: do not mistake a quality score for a correctness proof. Jones is describing the agent-facing interface: what the eval looks like to a coding agent that just needs to add or modify a case. Both are right. An eval is not a unit test in its epistemology and is very much like one in its ergonomics, and a chapter that flattens that tension loses something true about how the discipline actually works.

Real tasks beat synthetic cleverness

Figure 04.3/The observability flywheelCLICK TO ENLARGE

If evals encode judgment rather than facts, the question becomes where the judgment comes from — and the strongest answer in the corpus is that it is mined, not invented.

The best evaluation sets are rarely written from scratch in a clean room. They are drawn from operational history: the failed support conversations, the difficult research tasks, the painful coding regressions, the edge cases that triggered an escalation. What hurt you in production is far more informative than what looked clever in a benchmark, because it is real, specific, and already known to matter. Jain's team gives a concrete recipe to copy: take a real codebase, crawl its commit history, find the commits that fixed actual problems, and turn each fix into a graded task the agent has to reproduce — revert the fix, hand the agent the broken state, and score whether it gets back to the known-good commit. The escalation logs, incident tickets, and bug-fix commits you already have are an unmined eval set; the work of authoring it has mostly been done for you by the failures themselves. The eval is not a synthetic puzzle. It is a re-run of work that happened.

Samuel Colvin at Pydantic adds the discipline that keeps this honest under the pressure of the GenAI era. "We still want to build reliable, scalable applications," he notes, "and that is still hard — arguably harder with Gen AI than it was before." Human-seeded evals — examples a knowledgeable person labeled because they encode a real failure mode — are unusually valuable precisely because they carry that hard-won knowledge into a form the system can be tested against repeatedly. The seeding is the point. A human who has seen the system fail in a particular way writes the case that catches that failure forever after.

The cost is real: natural tasks are harder to score and harder to maintain than toy benchmarks. But treat that difficulty as a signal, not a deterrent — it is the trap of synthetic benchmarks that they stay cheap to score precisely because they have stopped resembling the work. The decision rule follows: when a benchmark is easy to grade and your eval set is passing comfortably, suspect that you are measuring the convenient unit rather than the real one, and go mine the next painful production failure instead. The more the system does genuine work, the less a synthetic eval can tell you about it.

Observability and evals are the same problem

Here the chapter's argument turns from offline measurement to the live system, and the cleanest statement of the turn comes from Phil Hetzel at Braintrust: "Observability and eval, to us, are actually the same problem from a systems perspective." That sentence is worth sitting with, because it dissolves a distinction most teams treat as fixed.

The usual mental model keeps them apart: evals are the offline thing you run before shipping, observability is the production thing you watch after. Hetzel's is that they are one loop. Production traces are not merely debugging artifacts you inspect when something breaks. They are the raw material for tomorrow's regression set. Every real interaction the system has — every success, every failure, every weird edge case a user actually hit — is a candidate eval case, and the strongest teams close that loop deliberately: traces feed failure analysis, failure analysis feeds the eval set, the eval set steers the next version, and the next version is watched in production again. Observability is not downstream of evals. It is where the next generation of evals is born.

This is also why Hetzel insists that "an eval platform is not just a test runner." A test runner executes assertions and reports pass/fail. An eval platform has to hold datasets, persist results across versions, support comparison workflows, render traces beside scores, and produce scoring credible enough to act on. Treat that list as a buy-or-build checklist: if your tool cannot persist results across versions and put a trace next to its score, it is a test runner wearing an eval platform's name, and it will not catch drift. The infrastructure is not incidental to the discipline. It is the discipline, made operable. A team that treats evals as a script they run by hand will measure once, feel reassured, and miss the drift that the loop was supposed to catch.

When the agents read the evals too

There is a final turn that the rest of this book has been setting up, and the corpus has exactly one strongly-argued account of it. Once coding agents are doing most of the implementation work, the eval system stops being something only humans read. It becomes an artifact the agents themselves must navigate and modify.

Lawrence Jones at incident.io describes building this the hard way. The team stored evals as YAML next to their Go prompt files, and then watched the natural instinct — wrap the evals in richer and richer browser UIs — fail twice over. Humans liked the dashboards but did not have time to use them, and the coding agents could not navigate them at all: "coding agents weren't able to work with them." The unlock was not a better UI. It was a small CLI — "a small CLI tool that we call eval tool, designed to allow agents to leverage our eval suite files." The eval suite became an interface an agent fleet could plug into, rather than a destination a human had to visit.

The same inversion solved their observability problem. incident.io had built rich web UIs to debug AI traces; the agents, again, could not use them. So instead of wrapping the trace database in a fancier front end, they dumped the whole thing as a file tree — because, as Jones puts it, "file systems are exceptionally good agent context." Then they pushed it further: their "scrapbook" pipeline downloads every backtest investigation as a file system and runs roughly twenty-five agents in parallel, one per investigation, clustering the analyses into cohort patterns. The output is not a number on a dashboard. It is a structured improvement report — agents evaluating agent output, with the human receiving a diff instead of a chart. Jones is careful to generalize: "these patterns do generalize" beyond incident response.

The chapter holds this conservatively — it rests on one talk, however well argued — but it is the natural endpoint of the control-system frame. When the thing being steered can also read the steering instrument, the eval system becomes part of the agent's own loop. The measurement and the work begin to share a substrate.

Why this is the operating system of trust

Pull the threads together and the chapter's resolves. Evals are not a quality-assurance ritual performed before launch and forgotten. They are how a delegated system earns and keeps trust over time. They externalize judgment — turning fuzzy standards like good, safe, and useful into examples, rubrics, and thresholds a system can be held to. They scale to the real unit of work instead of the convenient one. They draw their cases from what actually hurt rather than what looked clever. They close a loop with production rather than running once in a lab. And, increasingly, they become an interface the agents themselves participate in.

This connects directly forward. Chapter 5 will argue that context is the infrastructure determining what the model can even see — and an eval is how you find out whether your context assembly earns its tokens. Chapter 6 will argue that runtimes carry the work across time — and an eval is how you know the long-running system still behaves after the tenth resume. Chapter 9 will make eval and review capacity the throughput limit of an entire organization. In every case the eval is the instrument that converts a hope about the system into evidence about it.

The book's recurring is that reliability comes from the scaffolding around the model, not from the model's cleverness. Evals are the part of that scaffolding that tells you whether the rest of it is working. Without them, every other discipline in this book is a guess you have decided to believe. With them, it becomes something you can measure, steer, and trust.

What to do with this

  • Match the eval's unit to the unit of work. If the deliverable is a multi-file change or a multi-step workflow, stop grading the diff line by line and score the completed task at the level it operates — grading a snippet that no longer exists tells you almost nothing.
  • Mine your operational history instead of authoring from scratch. Crawl your commit history for fixes, revert each one, hand the agent the broken state, and score whether it reaches the known-good commit — the way Jain's team builds at Cursor. Your escalation logs, incident tickets, and bug-fix commits are an eval set the failures already wrote for you.
  • Seed evals from real failures, not clever puzzles. When someone who knows the system watches it fail in a particular way, capture that case so it catches that failure forever after — that human-seeded knowledge is the part a synthetic benchmark cannot give you.
  • Treat a comfortably-passing synthetic benchmark as a warning. When the set is cheap to grade and passing easily, suspect you are measuring the convenient unit, and go mine the next painful production failure instead.
  • Close the loop between observability and evals. Treat production traces as the raw material for tomorrow's regression set: route traces into failure analysis, failure analysis into the eval set, the eval set into the next version, then watch that version in production again.
  • Audit your tooling against the platform bar. If your eval tool cannot persist results across versions and render a trace beside its score, it is a test runner — it will let you measure once and miss the drift. Also check that an agent, not just a human dashboard, can read and modify the eval suite: a small CLI over a file tree beats a rich browser UI agents cannot navigate.

10 claims · 38 source anchors

Evidence — Source Anchors

Reliability comes less from model cleverness than from surrounding scaffolding

  • The important thing is not the code but the prompt and the guardrails that got you there.
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • Agents have intelligence and capabilities, but not always expertise that we need for real work.
    #83 — Barry Zhang & Mahesh Murag, Anthropicconfidence: high
  • these are three kind of like ingredients which are pretty simple and pretty basic, but I think provide an interesting kind of like first principles approach for how to think about
    #198 — Harrison Chase, LangChain/LangGraphconfidence: high

Harness quality is a major determinant of coding-agent quality

  • a good harness is really operationalized around giving the model text at the right time
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • there's so much work that has been put in uh over the last you know 20 to 30 years around the automated validation and verification of software that you build
    #57 — Eno Reyes, Factory AIconfidence: high
  • instead of micromanaging, what I'm doing is I'm scaffolding and providing context.
    #190 — Eric Hou, Augment Codeconfidence: high
  • identifying problems with the code because if there's no problems then it's probably high quality code
    #179 — Josh Albrecht, Imbueconfidence: high

Specs are not paperwork; they are executable intent

  • specs are natural language, you're using specs as a control surface to explain what you want the system to do.
    #40 — Al Harris, Amazon Kiroconfidence: high
  • leaving breadcrumbs, documentation, ADRs, persona oriented documentation around what a good job looks like.
    #16 — Ryan Lopopolo, OpenAIconfidence: high

The practical unit of AI coding is the codebase, not the snippet

  • snippets and my last project was generating an entire codebase.
    #72 — Naman Jain, Cursorconfidence: high
  • agents MD files an open standard
    #57 — Eno Reyes, Factory AIconfidence: high
  • codebase for harness engineering
    #16 — Ryan Lopopolo, OpenAIconfidence: high

Evals are a control system, not just a test suite

  • improvement without measurement is limited and imprecise.
    #125 — Ido Pesok, Vercel v0confidence: high
  • We still want to build reliable scalable applications and that is still hard
    #184 — Samuel Colvin, Pydanticconfidence: high
  • eval to us it's actually the same problem from a from a systems perspective.
    #628 — Phil Hetzel, Braintrustconfidence: high
  • small CLI tool that we call eval tool
    #689 — Lawrence Jones, incident.ioconfidence: high
  • designed to allow agents to leverage our eval suite files.
    #689 — Lawrence Jones, incident.ioconfidence: high
  • classic benchmark maxing.
    #746 — Ara Khan, Clineconfidence: high
  • There are right ways to use them. There are wrong ways to use them.
    #746 — Ara Khan, Clineconfidence: high

Realistic evals must be grounded in natural tasks and operational history

  • task should be natural and sourced from the real world and then you should be able to reliably grade them.
    #72 — Naman Jain, Cursorconfidence: high
  • If you build your application in a type safe way, if you use frameworks that allow it to be type safe, you can refactor it with confidence much more quickly.
    #184 — Samuel Colvin, Pydanticconfidence: high
  • Dynamic data sets have real world alignment.
    #153 — Quotient AI + Tavilyconfidence: high

Evals are strongest when they are trace-linked and fed by production observability

  • what is the gap between agent observability and what you're actually building. How do we mind that gap?
    #680 — Amy Boyd & Nitya Narasimhan, Microsoftconfidence: high
  • we go from like a testing and eval paradigm to a monitoring p uh paradigm.
    #655 — Danny Gollapalli & Ben Hylak, Raindropconfidence: high
  • where I've got some big production CI stack to go and run and deployment takes hours, being able to go and change variables in production or in staging very quickly
    #657 — Samuel Colvin, Pydanticconfidence: high
  • download all of the UI that we have as a file system?
    #689 — Lawrence Jones, incident.ioconfidence: high
  • 25 agents in parallel
    #689 — Lawrence Jones, incident.ioconfidence: high
  • it's actually the telemetry that does that.
    #750 — Dat Ngo, Arizeconfidence: high

Activity-based metrics misread motion as progress in AI-augmented work

  • these are not productivity metrics. They're useful, but you cannot just kind of use them like maximize them to maximize developer productivity.
    #79 — Yegor Denisov-Blanch, Stanford (120k devs study)confidence: high
  • I do think that AI increases developer productivity, but there's also cases in which it decreases developer productivity.
    #195 — Yegor Denisov-Blanch, Stanford (100k devs study)confidence: high
  • just plain old PR throughput. How many pull requests does the average engineer merge per week?
    #101 — Nick Arcolano, Jellyfish (20M PRs)confidence: high
  • I'm going to talk about how we pay engineers. And we pay engineers like salespeople.
    #63 — Arman Hezarkhani, Tenexconfidence: high

Problem framing and review become the scarce skills once execution is cheap

  • the new scarce skill is writing specifications that fully capture the intent
    #265 — Sean Grove, OpenAIconfidence: high
  • intentionally designed to put friction
    #14 — Armin Ronacher & Cristina Poncela Cubeiroconfidence: high
  • vibes aren't going to fix
    #132 — Chris Kelly, Augment Codeconfidence: high
  • I'm declaring war on slop today.
    #59 — swyxconfidence: high

The best evals encode judgment mined from operational history, not invented in a clean room

  • take a real codebase, crawl its commit history, find the commits that fixed actual problems, and turn each fix into a graded task
    #60 — Govind Jain, Stripeconfidence: high
  • handle state potentially over long periods of time. There needs to be human interaction for approvals
    #167 — Preeti Somal, Temporalconfidence: high