CONTENTS

Chapters

standalone →
standalone →
standalone →
standalone →
standalone →
standalone →
standalone →
standalone →
standalone →
standalone →
AI PRESS / AI ENGINEER KB © 2026
AI PRESS
0.00%
[ AI ENGINEER BOOK ]

From CopilottoColleague

AI Engineer Knowledge Base
[ 757 source videos mapped ]
[ 2h read ]
EVIDENCE OF SOURCE · CHAPTER 01
FIG. 01 · BEFORE · AFTERCLICK · SCROLL · ZOOM

AI Engineer Book · Ch 01

The Shift: From Assistant to Delegate

1/5

FIG. 01.0 · OPENER

Assistant vs delegate

Click to enlarge

CH01

CH. 01 // Drafting
1,922 words9 min read
CHAPTER 01/1,922 words/Drafting

Chapter 1 — The Shift: From Assistant to Delegate

The most important change in applied AI is not that chat got better. It is that people stopped wanting an answer and started wanting the work done.

For a few years the dominant experience of AI was answer production. You asked, and the system summarized, explained, drafted, brainstormed — with a fluency that mattered. But it was an interface story. A smarter text box. A faster first pass. A more conversational way to reach the software you already had. The verb was always tell me.

The sharper change begins when the verb becomes go do. Research this topic. Draft the contract. Refactor the service. Triage the queue. Investigate the failure. Come back with something a person can actually review and use. The moment the request crosses from "tell me" to "go do," the standard for success changes underneath it — and most of this book lives in the gap that opens up.

Three words for three different relationships

Figure 01.1/Assistant, copilot, delegateCLICK TO ENLARGE

It helps to be precise about what is shifting, because the words get used loosely. There are three relationships a person can have with an AI system, and they are not the same thing scaled up.

An assistant suggests. It produces something you then decide what to do with — a draft, an answer, an option. The human stays in the critical path for every step, and the assistant's job is to make that step faster.

A copilot collaborates inside a human loop. It works alongside you in real time, and the canonical case is the coding copilot completing the line you were already typing. The human is still flying the plane; the copilot is making continuous small contributions to a task the human is actively driving.

A delegate is assigned work and expected to return with it done. Not a suggestion to evaluate, but an artifact, a recommendation, or a completed step. The human steps out of the moment-to-moment loop and re-enters at review. That single move — out of the loop, back at the end — is the whole shift, and it changes what the system has to be.

This book's is not that assistants and copilots disappear. They remain useful, and for many tasks they are the right tool. The is that the engineering difficulty, the product ambition, and the organizational upheaval have all migrated to the third category. Delegation is where the hard problems are, because a is no longer just saying things. It is shaping work that someone else will rely on.

Why delegation changes the failure surface

Figure 01.2/Tell me vs go doCLICK TO ENLARGE

When an assistant is wrong, the cost is bounded by the fact that a human is reading every word it produces. The error surfaces immediately, in context, with the person who asked still holding full attention. Suggestion is safe partly because it is supervised by construction.

Delegation removes that built-in supervision. The human is not watching each step; they are waiting for a result. So an error no longer surfaces when it happens — it surfaces later, downstream, possibly after it has been built on. A wrong suggestion is a wrong sentence. A wrong delegated action is a wrong artifact that other work now depends on. The failure surface stretches from a single response across an entire workflow, and that stretch is exactly what makes delegation an engineering problem rather than a UX one.

This is why so many practitioners describe the same wall. Jacob Lauritzen, building legal AI at Legora, puts it plainly: vertical AI and complex agents "need more than just the chat." The chat was never the hard part. The hard part is everything that has to exist around it before the work it produces can be trusted without someone watching it happen.

And the gap is not closed by a smarter model alone. Barry Zhang and Mahesh Murag at Anthropic name the distinction carefully: agents "have intelligence and capabilities, but not always expertise that we need for real work." Capability and expertise are different things. A model can be brilliant at the abstract task and still lack the specific context, the conventions, the institutional knowledge that separate a plausible result from a usable one. Intelligence is necessary. It has turned out not to be sufficient.

The chat box was always the smallest part

Figure 01.3/Chat is the tip of the icebergCLICK TO ENLARGE

Once a system is doing real work rather than answering questions, the conversational surface becomes only the visible tip of it. The actual product is the apparatus underneath, and it is worth treating that apparatus as a checklist when you scope a delegation feature: context assembly that decides what the system knows, tool access that decides what it can do, workflow structure that holds a multi-step task together, quality checks that catch errors before a human does, durable state that survives interruption, review layers where a person re-enters, and observability so that person can see what happened. If a delegated workflow is missing one of these, that is where it will fail first — and the chat box is the one part you can take for granted.

This is why "agent" products keep sprouting the same organs. They grow trace views, side panels, memory layers, approval queues, workflow diagrams. None of that is decoration; read it as a diagnostic instead. When a serious agent product lacks a trace view or an approval queue, treat the absence as a reliability gap, not a leaner design — it means the trust those organs externalize is still implicit, riding on a human watching. They are the reliability work becoming visible. Joel Hron at Thomson Reuters describes the target as systems that don’t just suggest but plan their own work, execute it, and replan as they learn. Every word past suggest in that sentence is a new engineering surface, and the rest of this book is largely a tour of them.

The evidence for this shift is not a controlled study proving that broad delegation already works well everywhere. It does not yet. What the corpus shows is something more specific and, in its way, more credible: a strong convergence among serious builders about what they are trying to make these systems do, and a remarkably consistent account of where it gets hard. This book reports that convergence and the engineering it is producing. It does not the problem is solved.

Two cases this book keeps returning to

Figure 01.4/The failure surface stretchesCLICK TO ENLARGE

The two cases stress the same idea from opposite directions.

The first is the software factory. A coding agent looks magical on a small, self-contained task and then degrades on larger ones. The trap is to read that degradation as a model ceiling and wait for a better model; the move practitioners keep making instead is to improve the workplace around the agent — the repository, the , the specs, the evals, and the runtime. When delegated work falls apart at scale, fix the environment before you blame the model: raw capability is not enough for dependable delegated work, and the workplace itself has to be made legible to the agent. Chapter 3 takes this up directly.

The second is the high-stakes colleague. A legal, tax, compliance, or enterprise-research assistant begins as a helpful conversational surface and then gets asked to do work with professional consequences. At that point provenance, access boundaries, retrieval discipline, durable trajectories, and explicit review points stop being optional. Fluency is still useful, but it is no longer the thing being bought. Chapters 5 and 7 return here.

The two cases are deliberately unlike each other. One lives in software, where artifacts are executable and testable and a failing test is an honest signal. The other lives in knowledge work, where authority, evidence, and institutional accountability carry the weight and the failure is a wrong judgment rather than a red build. They are different enough that their agreement means something. Both arrive at the same conclusion: once the task becomes delegated work, intelligence alone is not enough.

What this opening sets up

That conclusion hands directly to the next chapter. If delegation makes execution cheap, it does not make judgment cheap — it makes judgment scarce, and therefore more valuable. Taste, standards, and review quality stop being background virtues and become the constraints that keep cheap output from turning into expensive mess. Chapter 2 is the human half of this opening argument, and it has to come before the technical core, because the technical core only matters in service of judgment that someone still has to supply.

From there the book asks its real question: what has to surround model intelligence before a team can trust it with delegated work? The answer is a stack of enabling conditions, and the chapters are that stack — stronger human standards (Chapter 2), legible (Chapter 3), evals as a control system (Chapter 4), context as infrastructure (Chapter 5), durable runtimes and human control (Chapter 6), security and bounded authority (Chapter 7), the realtime stress test (Chapter 8), the organization that holds it all (Chapter 9), and finally what survives when the tools turn over (Chapter 10).

So the important fact about modern AI is not that it can talk. It is that people increasingly want it to work — not merely to generate ideas but to return artifacts, not merely to answer but to complete bounded steps, not merely to sound plausible but to produce work a human can inspect, redirect, and trust. That demand is what turns model progress into an engineering discipline. The shift from assistant to is where this book begins because it is where the engineering begins.

What to do with this

  • Classify each AI feature you ship as assistant, copilot, or before you build it. Assistant and copilot keep a human in the critical path every step; only a has the human step out of the loop and re-enter at review. If you are building a , expect the engineering, product, and organizational difficulty to be categorically higher — that is where the hard problems are.
  • When you cross a feature from "tell me" to "go do," map where errors will surface. A wrong suggestion is a wrong sentence caught in context; a wrong delegated action is a wrong artifact that later work depends on. Before delegating, decide how that downstream error gets caught, because the built-in supervision of someone reading every word is gone.
  • Stop expecting a smarter model alone to close the gap. Per Anthropic's Barry Zhang and Mahesh Murag, agents have "intelligence and capabilities, but not always the expertise we need for real work" — so budget explicitly for the missing expertise: the specific context, conventions, and institutional knowledge that turn a plausible result into a usable one.
  • Treat the apparatus as a build checklist, not the chat box. For any delegated workflow, account for context assembly, tool access, workflow structure, quality checks, durable state, review layers, and observability — and when a serious agent product is missing one of these organs (a trace view, an approval queue), read the absence as a reliability gap rather than a cleaner design.
  • When a coding agent degrades on larger tasks, fix the environment before blaming the model. Improve the repository, , specs, evals, and runtime so the workplace is legible to the agent — and apply the same lesson to high-stakes knowledge work, where provenance, access boundaries, retrieval discipline, durable trajectories, and explicit review points stop being optional once the output carries professional consequences.
EVIDENCE OF SOURCE · CHAPTER 01 · VIDEOS

4 claims · 12 source anchors

Evidence — Source Anchors

The important transition is from suggestion to delegated execution

  • from helpfulness to productive
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • I think they need more
    #3 — Jacob Lauritzen, Legoraconfidence: high
  • most primitives the magic happens when you combine these things together
    #138 — Sam Bhagwat, Mastra.aiconfidence: high

Chat is an insufficient control surface for long-running or high-stakes work

  • Chat is one-dimensional. It's a very low bandwidth interface,
    #3 — Jacob Lauritzen, Legoraconfidence: high
  • we're asking AI systems to now produce output and produce judgments and decisions
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • handle state potentially over long periods of time. There needs to be human interaction for approvals
    #167 — Preeti Somal, Temporalconfidence: high

Reliability comes less from model cleverness than from surrounding scaffolding

  • The important thing is not the code but the prompt and the guardrails that got you there.
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • Agents have intelligence and capabilities, but not always expertise that we need for real work.
    #83 — Barry Zhang & Mahesh Murag, Anthropicconfidence: high
  • these are three kind of like ingredients which are pretty simple and pretty basic, but I think provide an interesting kind of like first principles approach for how to think about
    #198 — Harrison Chase, LangChain/LangGraphconfidence: high

Human oversight works best as an architectural layer, not an afterthought

  • There needs to be human interaction for approvals or other reasons and of course they need to be able to be uh able to run in parallel for efficiency
    #167 — Preeti Somal, Temporalconfidence: high
  • dial these agency dials far up.
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • maintaining a factory would require you to have an overview of the processes you want your coding agents to go through.
    #629 — Eric Zakariasson, Cursorconfidence: high
AI QUALITY · CHAPTER 01 · MASH JUDGES

scored on version git:2f2668c

EVIDENCE OF SOURCE · CHAPTER 02
FIG. 02 · BEFORE · AFTERCLICK · SCROLL · ZOOM

AI Engineer Book · Ch 02

Taste Still Matters When Code Gets Cheap

1/4

FIG. 02.0 · OPENER

Vibe coding vs vibe engineering

Click to enlarge

CH02

CH. 02 // Drafting
1,996 words9 min read
CHAPTER 02/1,996 words/Drafting

Chapter 2 — Taste Still Matters When Code Gets Cheap

Cheap code is easy to misread. When a model can generate a feature, a test suite, and the glue between them in the time it takes to describe them, the obvious conclusion is that engineering itself has gotten cheap. Part of it has. Routine production — the typing, the boilerplate, the third CRUD endpoint that looks like the first two — is easier than it was.

But cheap output is not the same as cheap judgment, and the gap between those two is the subject of this chapter. If Chapter 1 argued that the real shift is from suggestion to delegated work, the immediate human question is what remains scarce once the output is abundant. The answer is taste — and abundance does not retire taste. It promotes it.

The argument in three lines

Figure 02.1/The cheap-code syllogismCLICK TO ENLARGE

The chapter's is almost arithmetic:

  • Generation is getting cheaper.
  • Bad decisions are not getting cheaper.
  • Therefore taste, standards, and review matter more, not less.

A machine can accelerate production without removing the need for discrimination. In fact it sharpens that need, because it widens the gap between what can be produced and what should be. When a team could only build a little, the act of building was itself a filter — you thought hard about what was worth the effort. Remove the cost of building and you remove the filter. Something has to replace it, and that something is judgment.

Matt Pocock, whose work on software fundamentals runs directly counter to the automation-triumphalist mood, states the consequence bluntly: "Software fundamentals matter now more than they actually ever have." That sounds paradoxical until you see the mechanism. Fundamentals are how you tell good output from convincing output, and a world flooded with convincing output needs that skill more, not less. The model raised the volume of plausible work. It did not raise anyone's ability to evaluate it — that still has to come from a person who knows what good looks like.

What "taste" actually means here

Figure 02.2/The vibe-coding mode switchCLICK TO ENLARGE

Taste, in this chapter, is not aesthetics or personal preference. It is a specific discriminative capacity: the ability to notice the difference between things that look the same on the surface and are not the same underneath.

Between output that merely works and output that actually fits the system it lives in. Between a fast prototype and a maintainable design. Between a plausible draft and a trustworthy one. Between local correctness — this function is right — and broader coherence — this function belongs here, in this shape, for this reason.

Those distinctions were always part of senior engineering judgment. What changes under AI is that they become the binding skill rather than a background one, because the system can now produce more than a human can casually validate. When a person wrote every line, reviewing it was partly automatic — you understood it because you made it. When a model writes it, that understanding has to be supplied deliberately, by someone who can look at fluent, working output and still ask whether it is the right output. Tuomas Artman, reflecting with Gergely Orosz on craft at Linear, frames the open question precisely: "What happens when agents are capable of doing everything immediately for you?" The answer this chapter gives is that the scarce contribution moves from making to discerning.

Vibe coding: powerful, and dangerous in the same breath

Figure 02.3/Friction is judgmentCLICK TO ENLARGE

The clearest live example of the taste problem is the practice that has come to be called — building by rapid, loosely-specified prompting, steering on feel rather than on a plan. The corpus is split on it, and the split is instructive rather than confused.

is excellent for exploration. For rough prototypes, interface sketches, internal tools, one-off automation, and the early work of figuring out what is even worth building, it is a remarkable accelerant. Steering on feel is the right mode when the goal is discovery, because in discovery you do not yet know enough to specify, and premature rigor would just slow the learning.

It becomes dangerous at one specific moment: when an exploratory mode quietly hardens into a production philosophy. The problem is not speed. The problem is shipping output that no one fully understands, can maintain, or can confidently review — output that works today and becomes a liability the first time it has to change. swyx, who runs the AI Engineer community, captures the reaction this provokes among engineers who have to live with the aftermath: "I'm declaring war on slop today." Slop is the precise failure mode here — work that looks finished and transfers its real cost downstream to whoever has to understand it next.

So the right conclusion is not "never vibe code." It is a mode switch with a concrete test: is the wrong default the moment the output has to be maintained, reviewed, or changed by someone other than the person who prompted it. Use it freely for the discovery cases named above — prototypes, interface sketches, internal tools, one-off automation, figuring out what is worth building. The skill is not picking a side. It is knowing which mode you are in, and noticing when you have drifted from one into the other without deciding to — typically when an exploratory prototype quietly starts being treated as the thing you ship.

The friction that was load-bearing

AI removes a great deal of wasteful friction, and that is good. Nobody should mourn the boilerplate. But not all friction is waste, and the dangerous move is to treat every pause as inefficiency to be optimized away.

Some friction is where judgment was happening. The review before the merge. The extra question about whether the architecture can carry this. The refusal to accept a generic draft that technically satisfies the request. The decision to rewrite something correct-but-wrong — code that passes every test and still encodes the wrong model of the problem. Those pauses look like drag on a velocity dashboard. They are often the exact points where quality was being created, and removing them does not make the work faster so much as make the absence of judgment invisible until later. Armin Ronacher, in a talk with Cristina Poncela Cubeiro titled The Friction Is Your Judgment, names this directly: the friction worth keeping is "intentionally designed to put" back into engineering workflows — because the alternative is to optimize away the place where the judgment lives.

This is the trap that makes cheap execution treacherous rather than simply beneficial: a badly framed task handed to a strong model wastes more time than it used to, because the system will sprint confidently in the wrong direction and produce a great deal of polished output before anyone notices it was the wrong direction. The practical implication is to invest the friction up front, before the model runs, rather than removing it: the cost of a wrong frame now scales with how fast the model can execute it. Speed without judgment is not neutral. It manufactures expensive mistakes efficiently.

The new scarce skills: framing and review

Two specific capacities become the bottleneck once execution is cheap.

The first is problem framing. When the model will faithfully execute whatever it is pointed at, pointing it becomes the high-leverage act. Before handing a task to a model, answer five questions explicitly: What is the actual task? What counts as success? Which constraints are real and which are habit? What is allowed to stay rough? What would make the result unacceptable even if it looked finished? Treat that last question as a gate — if you cannot state what would make fluent, working output unacceptable, you are not ready to the task. A team that can answer those questions sharply gets a force multiplier; a team that cannot gets fluent output aimed at the wrong target. Framing is no longer the cheap part of the work before the real work starts. In a world of cheap execution, framing is the work.

The second is review as anti-slop discipline. Review stops being a bureaucratic gate and becomes the place where standards are actively defended against the pressure of fast, plausible output. This is the same argument the book will make about evals in Chapter 4 — that measurement is a control system, not a chore — arriving here first in its human form. Cheap output that creates expensive cleanup was never actually cheap; review is where a team refuses to let that hidden cost through. In an AI-native team, the reviewer is not slowing the work down. They are the reason the work can be trusted to go fast.

The bridge: from judgment to harness

Once judgment becomes the scarce input, an obvious question follows, and it is the question that opens the technical core of the book: how do you encode that judgment so machines can work inside it?

Because taste cannot stay locked in a senior engineer's head, surfacing only at review time to reject what the model already spent effort building. That does not scale, and it wastes the very speed the model offered. The judgment has to be externalized — pushed upstream into the environment the agent works in, so the standards shape the output as it is produced rather than catching it afterward. Into , specs, repository structure, validation rules, and review systems. That externalization is exactly what Chapter 3 is about, and it is why this chapter sits where it does: the only makes sense once you accept that there is judgment worth encoding into it.

So when code gets cheaper, the real question was never whether humans matter less. It is which human contribution becomes scarce — and the answer is judgment under abundance: knowing what should be built, what standards apply, which tradeoffs are acceptable, and which seductive, working, plausible output needs to be rejected anyway. That is what taste means here. That is why it still matters. And that is why the rest of this book is, in a sense, an extended answer to the question of how to give judgment somewhere durable to live.

What to do with this

  • Before delegating a task to a model, answer the five framing questions out loud or in writing: what the actual task is, what counts as success, which constraints are real versus habitual, what is allowed to stay rough, and — as a gate — what would make the result unacceptable even if it looked finished. If you can't state that last one, you're not ready to yet.
  • Decide which mode you're in before you start, and re-check when output is about to ship. is the right default for prototypes, interface sketches, internal tools, one-off automation, and figuring out what's worth building; it's the wrong default the moment output has to be maintained, reviewed, or changed by someone other than the person who prompted it.
  • Watch for the silent drift: an exploratory prototype quietly being treated as the thing you ship. Name that transition explicitly when it happens, and switch modes rather than letting a discovery artifact harden into a production philosophy by default.
  • When work feels too fast, audit the friction you removed. Some pauses — the review before merge, the question about whether the architecture can carry this, the refusal of a correct-but-wrong draft that passes every test — are where judgment was happening. Don't optimize those away just because they slow a velocity dashboard.
  • Invest framing effort up front rather than catching problems at review. The cost of a badly framed task now scales with how fast the model executes it, so a wrong frame produces more polished output in the wrong direction before anyone notices. Treat review as the place where standards are actively defended against fast, plausible output — not as a bureaucratic gate.
EVIDENCE OF SOURCE · CHAPTER 02 · VIDEOS

3 claims · 12 source anchors

Evidence — Source Anchors

Cheap generation raises the value of taste and judgment rather than lowering it

  • software fundamentals matter now more than they actually ever have.
    #1 — Matt Pocock, AI Heroconfidence: high
  • capable of doing everything um immediately
    #6 — Tuomas Artman & Gergely Oroszconfidence: high
  • intentionally designed to put friction
    #14 — Armin Ronacher & Cristina Poncela Cubeiroconfidence: high

Vibe coding is an exploration mode that fails as a production default

  • It's called vibe engineering.
    #73 — Kitze, Sizzyconfidence: high
  • I'm declaring war on slop today.
    #59 — swyxconfidence: high
  • vibes aren't going to fix
    #132 — Chris Kelly, Augment Codeconfidence: high
  • The hangover is the resulting despair
    #106 — Corey J. Gallon, Rexmoreconfidence: high
  • vibe coding with confidence
    #127 — Itamar Friedman, Qodoconfidence: high

Problem framing and review become the scarce skills once execution is cheap

  • the new scarce skill is writing specifications that fully capture the intent
    #265 — Sean Grove, OpenAIconfidence: high
  • intentionally designed to put friction
    #14 — Armin Ronacher & Cristina Poncela Cubeiroconfidence: high
  • vibes aren't going to fix
    #132 — Chris Kelly, Augment Codeconfidence: high
  • I'm declaring war on slop today.
    #59 — swyxconfidence: high
AI QUALITY · CHAPTER 02 · MASH JUDGES

scored on version git:2f2668c

EVIDENCE OF SOURCE · CHAPTER 03
FIG. 03 · BEFORE · AFTERCLICK · SCROLL · ZOOM

AI Engineer Book · Ch 03

Harnesses, Specs, and Codebases Agents Can Actually Use

1/4

FIG. 03.0 · OPENER

Bare prompt vs engineered harness

Click to enlarge

CH03

CH. 03 // Drafting
3,214 words14 min read
CHAPTER 03/3,214 words/Drafting

Chapter 3 — Harnesses, Specs, and Codebases Agents Can Actually Use

When coding agents disappoint, teams usually blame the model.

The model missed a dependency. The model misunderstood the architecture. The model refactored the right function but violated a local convention no outsider could have guessed. The model produced code that technically passed, yet somehow still felt wrong. In the postmortem, intelligence becomes the default suspect.

Sometimes that diagnosis is fair. Models do fail because they are weak, distracted, or simply not yet capable enough for the task. But in production codebases, that explanation is often too flattering to the humans involved. Many agent failures are not evidence that the model is hopeless. They are evidence that the environment was never made legible enough for delegated work.

Ryan Lopopolo puts the inversion bluntly: “The important thing is not the code but the prompt and the guardrails that got you there.” The line sounds almost rude in a field obsessed with generated output. But it captures one of the deepest shifts in AI-native engineering. Once you ask a machine to do implementation work instead of merely suggesting snippets, the surroundings become part of the product. The around the model starts determining what kind of work is even possible.

If you want AI to write production software, do not begin by asking what model to use. Begin by asking whether your repository, specs, validations, and workflow are structured well enough for a machine collaborator to operate without constant rescue.

A small software-factory vignette

Figure 03.1/The repo is the interfaceCLICK TO ENLARGE

Consider a team that starts where many teams now start: with a good model inside an ordinary repo.

At first the results feel magical. The agent writes tests faster than the humans expect. It handles small UI changes cleanly. It can even land a respectable refactor if a senior engineer hovers nearby and corrects its misunderstandings in real time. So the team expands the scope. They ask it to wire together a new endpoint, touch a migration, update a frontend state machine, and preserve some vague house style that nobody has ever written down.

Quality immediately gets erratic.

One patch uses a dependency the team would never approve. Another passes tests but ignores a performance convention learned the hard way six months earlier. A third is logically fine yet shaped in a way that makes review irritating and rollback risky. The team says the model is inconsistent. What they really mean is that the workplace is inconsistent.

So they change the workplace.

They add explicit setup scripts instead of Slack archaeology. They tighten lint and type gates. They create agent-facing instructions. They check in examples of accepted patterns. They write slimmer task specs before handing work off. They stop relying on “everyone kind of knows how we do migrations here.” The repo becomes less like a haunted archive of past decisions and more like a managed surface for machine labor.

That is the recurring case I will call the . It matters because it shows where the leverage sits. The breakthrough is not that the model suddenly became a genius. The breakthrough is that the team stopped treating its own tacit judgment as invisible infrastructure.

The repo is the real interface

Figure 03.2/Specs persistCLICK TO ENLARGE

Most coding tools still present themselves through chat. You type a request, maybe select a few files, and wait for the assistant to propose a patch. That interface is useful, but it can also be misleading. It suggests that the real problem lives in the prompt.

In practice, the prompt is only the visible tip of a much larger system.

The real interface is the codebase plus everything around it: the setup instructions, the architecture, the naming conventions, the tests, the lint rules, the examples of good patches, the traces of previous reviews, the ADRs, the failure cases, and the rules about performance or security that no single file states explicitly. A human engineer entering a mature repository absorbs these constraints slowly. They ask teammates what matters. They notice which patterns recur. They learn what kinds of changes get approved quickly and which ones trigger suspicion.

An agent does not get that apprenticeship unless the team builds it.

Lopopolo makes the obligation explicit: “Your job is to build systems, software and structures that enable your team to be successful. And to do that, we need to make them legible to those agents that are driving the implementation.” That sentence is more radical than it first appears. It means the team is no longer only maintaining software for other humans. It is also maintaining a working environment for machine contributors.

That is why documentation, ADRs, examples, and historical breadcrumbs matter so much. Those are not decorative artifacts around the “real” software process. In an AI-native workflow, they become part of the execution environment itself. If human expectations live only in scattered memory, the agent cannot inherit them. If the rules of good work are mostly tacit, the agent will violate them in ways that look mysterious only because the team never externalized its own standards.

This is also why the practical unit of AI coding is no longer the snippet. It is the codebase. Naman Jain describes the shift cleanly: “My first project was actually working on generating single line... snippets and my last project was generating an entire codebase.” Once the unit of work becomes the repository, environment design stops being background hygiene and becomes first-order leverage.

The core mistake many teams make is to treat code generation as the primary problem and repo legibility as a secondary concern. In reality, the second often dominates the first — so the diagnostic move when an agent disappoints is to invert the blame: before swapping models, ask what a new senior hire would have had to ask a teammate to do this task safely, then check whether that answer exists anywhere the agent can read it. A capable model dropped into a murky repository is like a strong engineer dropped into an organization with no onboarding, inconsistent standards, and no access to prior decisions. You can still get lucky. You cannot count on it.

Good code contains hundreds of unstated decisions

Figure 03.3/The software factoryCLICK TO ENLARGE

One reason this problem is easy to underestimate is that experienced engineers are bad at seeing their own tacit judgment. Good code does not differ from bad code only in correctness. It differs in tone, proportion, naming, performance discipline, dependency choices, rollback safety, test shape, compatibility assumptions, reviewability, and fit with the broader system.

Lopopolo gives this problem a memorable scale when he says that producing a single patch can require “500 little decisions” around underspecified non-functional requirements. The exact number is not the point. The point is that repositories are dense with decisions that matter greatly but are rarely captured in the task description. A human engineer often fills those gaps through craft and context. A coding agent fills them through inference under uncertainty.

That is where slop comes from.

The sloppy patch is not always the sign of a stupid model. It is often the sign of a task whose invisible success criteria were never written down. The agent guessed because the environment forced it to. Then humans act surprised that the guesses look generic.

Once you see the problem this way, the prescription changes. The answer is not only “prompt better.” It is to reduce the amount of silent guesswork that the environment demands. Externalize architecture choices. Store examples of accepted patterns. Make non-functional constraints explicit. Give the system stable ways to discover how this team expects software to be built.

That is what means. Not a fancier wrapper around a model, but the systematic conversion of tacit engineering judgment into durable, machine-usable constraints.

This is also where Chapter 2 should still be echoing in the reader’s mind. Cheap generation raised the value of judgment. Chapter 3 is where that judgment stops living only inside senior people and starts getting encoded into the environment.

Specs are not paperwork; they are executable intent

This is where becomes more than a documentation preference.

In a purely human workflow, specs often compete with direct conversation. A strong team can get away with more ambiguity because engineers resolve a surprising amount through meetings, hallway discussions, pull-request comments, and local intuition. In an AI-mediated workflow, that ambiguity becomes more expensive. The failure mode is relying on chat to carry intent: context windows expire, tasks get retried, work gets decomposed into subproblems, and different agents touch different layers of the same system — so any intent that lives only in transient conversation dissolves the moment one of those events fires. The operational test is whether a piece of intent would survive a fresh agent picking up the task cold; if it would not, it belongs in a persistent artifact, not the prompt.

Al Harris offers one of the clearest framings in the corpus: “The spec then becomes the natural language representation of your system. It has constraints, it has concerns around functional requirements, non-functional requirements...” That framing matters because it upgrades the spec from document to control surface.

Harris makes a second point that is just as important: is “a structured workflow that we push you through to reliably deliver high-quality software... requirements, design, and execution phases.” In other words, the spec is not a memo attached to the work. It is part of the workflow that shapes the work.

A useful spec in an AI-native environment does several jobs at once. It records what problem is being solved. It states constraints the agent should not have to rediscover through trial and error. It makes room for non-functional requirements that are otherwise easy to drop. It persists across retries and handoffs. And it creates a shared object that humans and machines can both inspect.

Seen this way, specs are a form of context compression. They take sprawling intent that would otherwise live in chat history, tribal knowledge, or remembered conversation and package it into a stable artifact the workflow can keep returning to. They also make evaluation easier, because a system with a concrete spec can be judged against a clearer notion of success than one that began with a vague prompt and hope.

This does not mean every ticket needs an elaborate design doc. Over-specification can absolutely collapse exploration into bureaucracy, and small or exploratory work is best discovered through fast iteration — writing a heavy spec there is the wrong choice. The test for when to reach for one: the moment the cost of misunderstanding rises — because the task is large, parallelized across agents, safety-sensitive, or expensive to review — write the spec before generating. Below that line, prompt and iterate; above it, externalize intent first, because that is where a stable representation of intent becomes leverage rather than ceremony.

The deeper point is that matters more, not less, in an era of powerful coding agents. The stronger the generator, the more valuable a stable representation of intent becomes.

Agent-ready codebases are designed, not discovered

Eno Reyes is especially useful here because he connects old-fashioned engineering hygiene to an AI-native operating model. He begins with a deliberately basic question: “Do you have some automated validation for the format of your code?... for professional software engineers [it's] like, yeah, of course we do.” Then comes the important turn: “But I think you can go a step further.”

That extra step is the real substance of agent-readiness. The question is not simply whether a team has linting or tests. It is whether the codebase has enough automated validation and explicit structure that a coding agent can move through it with bounded risk. A repository becomes agent-ready when it exposes enough of its standards, setup, and quality gates that delegated work becomes legible.

A practical checklist usually includes at least the following:

  • a stable folder structure rather than a maze of historical accidents
  • explicit setup, build, and run commands that do not rely on oral tradition
  • strong type, lint, and test gates the agent can run repeatedly
  • architecture decisions stored in files instead of buried in memory
  • examples of accepted patterns for tests, APIs, migrations, and reviews
  • specs or task briefs stored close enough to the work that they survive handoff
  • narrower tools or scripts for common operations where free-form shell access is unnecessary

None of this is glamorous. That is exactly why it matters. The biggest gains in coding agents often come not from frontier prompting technique but from reducing avoidable ambiguity in the repository itself.

There is also an organizational benefit hiding inside this technical one. Better repositories do not only help machines. They help weaker humans, new hires, cross-functional contributors, and future maintainers. In that sense, “agent-ready” is not some alien new standard imposed by AI. It is a sharper test of whether the team actually encoded its own expectations in reusable form.

The agent is exposing the difference between standards the team possesses and standards the team can operationalize.

The harness is a workflow, not just a wrapper

It is tempting to imagine the as a thin layer around a model: maybe a system prompt, a tool list, a sandbox, and a few guardrails. That is too narrow.

A real includes environment setup, repository policy, validation steps, task decomposition, review surfaces, memory of prior work, failure handling, and the sequence in which all those things are applied. In other words, it is a workflow.

This is where the software-factory metaphor becomes useful. Eric Zakariasson talks directly about “building your own ,” and the phrase lands because it redirects attention from one-off generation to staged production. A factory has specifications, stations, checks, feedback loops, and manager-visible status. It does not assume that every worker can safely improvise in every direction.

Zakariasson makes a subtler point too: the factory itself needs a spec. “To set the spec for the factory,” he says, you would likely have a folder in the codebase with markdown guidance, best practices, and rules. That is exactly the conceptual move this chapter needs. The has to be encoded somewhere. Once it is checked into the repo, process stops living only in human habit and becomes part of the codebase’s working surface.

This is also where the book’s middle spine starts locking together. Once you think in , you immediately care about evaluation and runtime semantics. A that cannot measure quality is incomplete. A that cannot preserve state across longer tasks is fragile. Chapter 3 does not end the argument. It hands the book directly into Chapter 4.

Subagents and specialization belong to the harness

The same logic extends beyond a single agent.

One of the most interesting developments in modern coding systems is the move toward specialized roles: research agents, review agents, refactor agents, debugging agents, and broader subagent frameworks that let a larger task be decomposed into parallel, semi-independent work. OpenAI has demonstrated the pattern directly: in one workshop, Codex spun up parallel sub-agents to review a directory of files at once, each sub-agent assigned its own model, sandbox mode, and tool access, with an orchestrator partitioning the work and collating the results.

The key insight is not merely that parallelism can make things faster. It is that specialization makes process explicit.

When a team creates a dedicated review agent, a repo-auditing agent, or a migration-focused agent with narrow tools and instructions, it is encoding judgment about how work should be done. The role itself becomes part of the . This mirrors what strong human organizations already do. They do not treat every task as a blank slate executed by a generalist. They create roles, review structures, and bounded responsibilities so judgment can scale.

But subagents also intensify the need for good scaffolding. More workers without a stronger do not create a factory. They create chaos faster. The mistake to watch for: adding parallel agents before you can recompose, inspect, and evaluate their output — at that point you have multiplied generation without multiplying the checks, and the throughput is illusory. The rule of thumb is to add a subagent role only once the recomposition and review surface for its output already exists. That means subagents are not an argument against . They are evidence that is becoming more important.

The new advantage is environment design

The marketing of AI coding tools naturally focuses on generation. That is the visible magic. The agent edits a file. It writes a test. It proposes a patch. Those moments are real and often impressive.

But the durable advantage is increasingly elsewhere.

It belongs to teams that make their repositories legible. Teams that externalize non-functional judgment instead of leaving it trapped in senior engineers’ heads. Teams that treat specs as reusable intent rather than ceremonial paperwork. Teams that invest in validations and repo affordances that help an agent check its own work. Teams that gradually turn loose process into a staged, inspectable .

This is why deserves to be treated as a primary discipline instead of a tactical trick. The is not a helper around the codebase. It is becoming part of the codebase.

The winners in AI coding will not simply be the teams with access to the strongest models. They will be the teams that built workplaces those models can actually understand.

That is the bridge into the next chapter. Once the environment can produce delegated work at all, the obvious next question is no longer how to generate more. It is how to know whether the generated work is actually good.

What to do with this

  • When an agent disappoints, invert the blame before swapping models: ask whether your repository, specs, validations, and workflow are structured well enough for a machine collaborator to operate without constant rescue. Treat the unexplained "sloppy patch" as a missing success criterion, not a stupid model.
  • Run the agent-readiness checklist on one repo this week: a stable folder structure, explicit setup/build/run commands, strong type/lint/test gates the agent can run repeatedly, architecture decisions stored in files, examples of accepted patterns for tests/APIs/migrations/reviews, and specs stored close enough to the work to survive handoff.
  • Stop relying on "everyone kind of knows how we do migrations here." Externalize the tacit standards: check in examples of accepted patterns, write down non-functional constraints, and replace Slack archaeology with setup scripts the agent can read.
  • Gate spec-writing by cost of misunderstanding, not by habit. For small or exploratory work, prompt and iterate; once a task is large, parallelized, safety-sensitive, or expensive to review, write the spec first — with constraints and non-functional requirements — so it persists across retries and handoffs.
  • Apply the survival test to intent: if a fresh agent picking up the task cold would lose a piece of intent because context expired or work was decomposed, move that intent out of chat and into a persistent artifact.
  • Encode the factory's own rules into the repo. Create a folder of markdown guidance, best practices, and rules so process stops living only in human habit, and add a specialized subagent role only once the surface to recompose and review its output already exists.
EVIDENCE OF SOURCE · CHAPTER 03 · VIDEOS

10 claims · 33 source anchors

Evidence — Source Anchors

Reliability comes less from model cleverness than from surrounding scaffolding

  • The important thing is not the code but the prompt and the guardrails that got you there.
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • Agents have intelligence and capabilities, but not always expertise that we need for real work.
    #83 — Barry Zhang & Mahesh Murag, Anthropicconfidence: high
  • these are three kind of like ingredients which are pretty simple and pretty basic, but I think provide an interesting kind of like first principles approach for how to think about
    #198 — Harrison Chase, LangChain/LangGraphconfidence: high

Harness quality is a major determinant of coding-agent quality

  • a good harness is really operationalized around giving the model text at the right time
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • there's so much work that has been put in uh over the last you know 20 to 30 years around the automated validation and verification of software that you build
    #57 — Eno Reyes, Factory AIconfidence: high
  • instead of micromanaging, what I'm doing is I'm scaffolding and providing context.
    #190 — Eric Hou, Augment Codeconfidence: high
  • identifying problems with the code because if there's no problems then it's probably high quality code
    #179 — Josh Albrecht, Imbueconfidence: high

Specs are not paperwork; they are executable intent

  • specs are natural language, you're using specs as a control surface to explain what you want the system to do.
    #40 — Al Harris, Amazon Kiroconfidence: high
  • leaving breadcrumbs, documentation, ADRs, persona oriented documentation around what a good job looks like.
    #16 — Ryan Lopopolo, OpenAIconfidence: high

The practical unit of AI coding is the codebase, not the snippet

  • snippets and my last project was generating an entire codebase.
    #72 — Naman Jain, Cursorconfidence: high
  • agents MD files an open standard
    #57 — Eno Reyes, Factory AIconfidence: high
  • codebase for harness engineering
    #16 — Ryan Lopopolo, OpenAIconfidence: high

Agent-ready codebases are designed, not discovered

  • agents MD files an open standard
    #57 — Eno Reyes, Factory AIconfidence: high
  • context deficit as the biggest blocker.
    #190 — Eric Hou, Augment Codeconfidence: high
  • a garbage codebase you're going to get
    #621 — Matt Pocockconfidence: high

The harness is evolving from a local loop into a staged software factory

  • getting to a place where you can build your own like software factory
    #629 — Eric Zakariasson, Cursorconfidence: high
  • unified agent harness that will manage
    #632 — Vaibhav Srivastav & Katia Gil Guzman, OpenAIconfidence: high
  • parallel agents working together to fix
    #42 — Robert Brennan, OpenHandsconfidence: high
  • The difference with missions is that we run features serially.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • Our longest mission ran for 16 days
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • We just kind of gave each role its own kind of context window.
    #691 — Ash Prabaker & Andrew Wilson, Anthropicconfidence: high
  • it's no longer about the model or the agent. It's about the process.
    #743 — Vincent Koc, OpenClawconfidence: high

Harness quality now includes capability packaging, not only repo hygiene

  • That's what a skill is. You're teaching the the LLM how to do something in the way that you expect it to be done
    #654 — Nick Nisi & Zack Proser, WorkOSconfidence: high
  • This is how the agent is
    #683 — Pedro Rodrigues, Supabaseconfidence: high
  • 49% reduction of the initial
    #625 — Sam Morrow, GitHubconfidence: high
  • the schema is the UI for the agent.
    #744 — Michael Hablich, Google (Chrome DevTools)confidence: high

Coordination is the unsolved runtime primitive for multi-agent systems

  • the thing that's missing for me is coordination.
    #704 — Lou Bichard, Onaconfidence: high
  • through sort of state machines, you know, by building out workflows and effectively state machines
    #704 — Lou Bichard, Onaconfidence: high
  • They step on each other's changes. They duplicate work. They make inconsistent architectural decisions.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • we have the two agents basically negotiate what done actually means.
    #691 — Ash Prabaker & Andrew Wilson, Anthropicconfidence: high

Coding agents expose the gap between standards a team possesses and standards it can operationalize

  • there's so much work that has been put in uh over the last you know 20 to 30 years around the automated validation and verification of software that you build
    #57 — Eno Reyes, Factory AIconfidence: high
  • specs are natural language, you're using specs as a control surface to explain what you want the system to do.
    #40 — Al Harris, Amazon Kiroconfidence: high

Subagent specialization makes process explicit and encodes team judgment into roles

  • a good harness is really operationalized around giving the model text at the right time
    #16 — Ryan Lopopolo, OpenAIconfidence: high
AI QUALITY · CHAPTER 03 · MASH JUDGES
3 unsupported claims — ship-blocker

scored on version git:2f2668c

EVIDENCE OF SOURCE · CHAPTER 04
FIG. 04 · BEFORE · AFTERCLICK · SCROLL · ZOOM

AI Engineer Book · Ch 04

Evals Are the Control System

1/4

FIG. 04.0 · OPENER

Vibes & benchmarks vs operational eval loop

Click to enlarge

CH04

CH. 04 // Drafting
2,398 words11 min read
CHAPTER 04/2,398 words/Drafting

Chapter 4 — Evals Are the Control System

The obvious failure mode of AI is that it can be wrong. The more dangerous one is that it can look right often enough that the team stops measuring.

A demo works twice. A prototype feels sharp. A coding agent lands a decent patch. A support assistant answers a handful of questions convincingly, and everyone starts speaking in the language of vibes — the system feels promising, maybe even close to ready. That is exactly where the trouble begins, because a promising feeling is not a control loop. Production trust comes from the ability to compare versions, catch regressions, preserve hard-won lessons, and measure whether the system still works when real users, real data, and real edge cases arrive. That is what evals are for, and this chapter argues they are not a side practice you bolt on before launch. They are the operating system of a system you intend to trust.

This is the same argument Chapter 2 made about review, now mechanized. There, judgment under abundance was a human posture; here it becomes an instrument. Once a system does delegated work, you cannot eyeball every output, and the discipline that kept cheap generation honest has to become something you can run.

The unit of evaluation changed

Figure 04.1/Evals are not unit testsCLICK TO ENLARGE

For most of the short history of AI evaluation, the unit was small: a single completion, a one-line answer, a snippet judged in isolation. That worked while the systems themselves were small. It stops working the moment the system's job grows.

Naman Jain, who builds at Cursor, names the shift in his own work: "coding capabilities have leapt from generating one-line snippets to completing entire codebases with agentic workflows." When the deliverable was a snippet, you could grade the snippet. When the deliverable is a multi-file change across a real repository, grading the diff line by line tells you almost nothing about whether the system did the job. The unit of evaluation has to grow to match the unit of work. A codebase change, a multi-step workflow, a retrieval-heavy research task — each has to be judged at the level it operates, not at the level that happens to be easy to score.

This is why the chapter's title insists on control system rather than test suite. A test suite checks fixed assertions about small units. A control system steers a large, drifting process toward a goal over time. The distinction is not pedantic: the choice of unit determines what you build, what you measure, and whether your measurement survives contact with a system that does real work.

Evals are not unit tests — and also are

Figure 04.2/The unit of evaluationCLICK TO ENLARGE

Ido Pesok, working on evals at Vercel's v0, gave a talk with a deliberately blunt title: "Evals are not unit tests." His point is a reasoning posture. A unit test encodes a binary fact — the function returns 4, or it is broken. An eval rarely encodes a fact like that. It encodes a judgment about quality, helpfulness, safety, or fit, and treating a pass/fail eval score as if it were a unit test's green checkmark quietly smuggles in a certainty the measurement does not have. Application-layer evals are messy because reality is messy: users, latency, cost, policy, and workflow constraints all bear on whether an output was actually good, and none of them reduce cleanly to true or false.

It is worth holding that against a practitioner who says the opposite. Lawrence Jones at incident.io calls them, flatly, "AI unit tests" — and stores them as YAML files next to the prompts they grade. The disagreement looks sharp and is actually productive, because the two are describing different surfaces of the same artifact. Pesok is describing the human reasoning stance: do not mistake a quality score for a correctness proof. Jones is describing the agent-facing interface: what the eval looks like to a coding agent that just needs to add or modify a case. Both are right. An eval is not a unit test in its epistemology and is very much like one in its ergonomics, and a chapter that flattens that tension loses something true about how the discipline actually works.

Real tasks beat synthetic cleverness

Figure 04.3/The observability flywheelCLICK TO ENLARGE

If evals encode judgment rather than facts, the question becomes where the judgment comes from — and the strongest answer in the corpus is that it is mined, not invented.

The best evaluation sets are rarely written from scratch in a clean room. They are drawn from operational history: the failed support conversations, the difficult research tasks, the painful coding regressions, the edge cases that triggered an escalation. What hurt you in production is far more informative than what looked clever in a benchmark, because it is real, specific, and already known to matter. Jain's team gives a concrete recipe to copy: take a real codebase, crawl its commit history, find the commits that fixed actual problems, and turn each fix into a graded task the agent has to reproduce — revert the fix, hand the agent the broken state, and score whether it gets back to the known-good commit. The escalation logs, incident tickets, and bug-fix commits you already have are an unmined eval set; the work of authoring it has mostly been done for you by the failures themselves. The eval is not a synthetic puzzle. It is a re-run of work that happened.

Samuel Colvin at Pydantic adds the discipline that keeps this honest under the pressure of the GenAI era. "We still want to build reliable, scalable applications," he notes, "and that is still hard — arguably harder with Gen AI than it was before." Human-seeded evals — examples a knowledgeable person labeled because they encode a real failure mode — are unusually valuable precisely because they carry that hard-won knowledge into a form the system can be tested against repeatedly. The seeding is the point. A human who has seen the system fail in a particular way writes the case that catches that failure forever after.

The cost is real: natural tasks are harder to score and harder to maintain than toy benchmarks. But treat that difficulty as a signal, not a deterrent — it is the trap of synthetic benchmarks that they stay cheap to score precisely because they have stopped resembling the work. The decision rule follows: when a benchmark is easy to grade and your eval set is passing comfortably, suspect that you are measuring the convenient unit rather than the real one, and go mine the next painful production failure instead. The more the system does genuine work, the less a synthetic eval can tell you about it.

Observability and evals are the same problem

Here the chapter's argument turns from offline measurement to the live system, and the cleanest statement of the turn comes from Phil Hetzel at Braintrust: "Observability and eval, to us, are actually the same problem from a systems perspective." That sentence is worth sitting with, because it dissolves a distinction most teams treat as fixed.

The usual mental model keeps them apart: evals are the offline thing you run before shipping, observability is the production thing you watch after. Hetzel's is that they are one loop. Production traces are not merely debugging artifacts you inspect when something breaks. They are the raw material for tomorrow's regression set. Every real interaction the system has — every success, every failure, every weird edge case a user actually hit — is a candidate eval case, and the strongest teams close that loop deliberately: traces feed failure analysis, failure analysis feeds the eval set, the eval set steers the next version, and the next version is watched in production again. Observability is not downstream of evals. It is where the next generation of evals is born.

This is also why Hetzel insists that "an eval platform is not just a test runner." A test runner executes assertions and reports pass/fail. An eval platform has to hold datasets, persist results across versions, support comparison workflows, render traces beside scores, and produce scoring credible enough to act on. Treat that list as a buy-or-build checklist: if your tool cannot persist results across versions and put a trace next to its score, it is a test runner wearing an eval platform's name, and it will not catch drift. The infrastructure is not incidental to the discipline. It is the discipline, made operable. A team that treats evals as a script they run by hand will measure once, feel reassured, and miss the drift that the loop was supposed to catch.

When the agents read the evals too

There is a final turn that the rest of this book has been setting up, and the corpus has exactly one strongly-argued account of it. Once coding agents are doing most of the implementation work, the eval system stops being something only humans read. It becomes an artifact the agents themselves must navigate and modify.

Lawrence Jones at incident.io describes building this the hard way. The team stored evals as YAML next to their Go prompt files, and then watched the natural instinct — wrap the evals in richer and richer browser UIs — fail twice over. Humans liked the dashboards but did not have time to use them, and the coding agents could not navigate them at all: "coding agents weren't able to work with them." The unlock was not a better UI. It was a small CLI — "a small CLI tool that we call eval tool, designed to allow agents to leverage our eval suite files." The eval suite became an interface an agent fleet could plug into, rather than a destination a human had to visit.

The same inversion solved their observability problem. incident.io had built rich web UIs to debug AI traces; the agents, again, could not use them. So instead of wrapping the trace database in a fancier front end, they dumped the whole thing as a file tree — because, as Jones puts it, "file systems are exceptionally good agent context." Then they pushed it further: their "scrapbook" pipeline downloads every backtest investigation as a file system and runs roughly twenty-five agents in parallel, one per investigation, clustering the analyses into cohort patterns. The output is not a number on a dashboard. It is a structured improvement report — agents evaluating agent output, with the human receiving a diff instead of a chart. Jones is careful to generalize: "these patterns do generalize" beyond incident response.

The chapter holds this conservatively — it rests on one talk, however well argued — but it is the natural endpoint of the control-system frame. When the thing being steered can also read the steering instrument, the eval system becomes part of the agent's own loop. The measurement and the work begin to share a substrate.

Why this is the operating system of trust

Pull the threads together and the chapter's resolves. Evals are not a quality-assurance ritual performed before launch and forgotten. They are how a delegated system earns and keeps trust over time. They externalize judgment — turning fuzzy standards like good, safe, and useful into examples, rubrics, and thresholds a system can be held to. They scale to the real unit of work instead of the convenient one. They draw their cases from what actually hurt rather than what looked clever. They close a loop with production rather than running once in a lab. And, increasingly, they become an interface the agents themselves participate in.

This connects directly forward. Chapter 5 will argue that context is the infrastructure determining what the model can even see — and an eval is how you find out whether your context assembly earns its tokens. Chapter 6 will argue that runtimes carry the work across time — and an eval is how you know the long-running system still behaves after the tenth resume. Chapter 9 will make eval and review capacity the throughput limit of an entire organization. In every case the eval is the instrument that converts a hope about the system into evidence about it.

The book's recurring is that reliability comes from the scaffolding around the model, not from the model's cleverness. Evals are the part of that scaffolding that tells you whether the rest of it is working. Without them, every other discipline in this book is a guess you have decided to believe. With them, it becomes something you can measure, steer, and trust.

What to do with this

  • Match the eval's unit to the unit of work. If the deliverable is a multi-file change or a multi-step workflow, stop grading the diff line by line and score the completed task at the level it operates — grading a snippet that no longer exists tells you almost nothing.
  • Mine your operational history instead of authoring from scratch. Crawl your commit history for fixes, revert each one, hand the agent the broken state, and score whether it reaches the known-good commit — the way Jain's team builds at Cursor. Your escalation logs, incident tickets, and bug-fix commits are an eval set the failures already wrote for you.
  • Seed evals from real failures, not clever puzzles. When someone who knows the system watches it fail in a particular way, capture that case so it catches that failure forever after — that human-seeded knowledge is the part a synthetic benchmark cannot give you.
  • Treat a comfortably-passing synthetic benchmark as a warning. When the set is cheap to grade and passing easily, suspect you are measuring the convenient unit, and go mine the next painful production failure instead.
  • Close the loop between observability and evals. Treat production traces as the raw material for tomorrow's regression set: route traces into failure analysis, failure analysis into the eval set, the eval set into the next version, then watch that version in production again.
  • Audit your tooling against the platform bar. If your eval tool cannot persist results across versions and render a trace beside its score, it is a test runner — it will let you measure once and miss the drift. Also check that an agent, not just a human dashboard, can read and modify the eval suite: a small CLI over a file tree beats a rich browser UI agents cannot navigate.
EVIDENCE OF SOURCE · CHAPTER 04 · VIDEOS

10 claims · 38 source anchors

Evidence — Source Anchors

Reliability comes less from model cleverness than from surrounding scaffolding

  • The important thing is not the code but the prompt and the guardrails that got you there.
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • Agents have intelligence and capabilities, but not always expertise that we need for real work.
    #83 — Barry Zhang & Mahesh Murag, Anthropicconfidence: high
  • these are three kind of like ingredients which are pretty simple and pretty basic, but I think provide an interesting kind of like first principles approach for how to think about
    #198 — Harrison Chase, LangChain/LangGraphconfidence: high

Harness quality is a major determinant of coding-agent quality

  • a good harness is really operationalized around giving the model text at the right time
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • there's so much work that has been put in uh over the last you know 20 to 30 years around the automated validation and verification of software that you build
    #57 — Eno Reyes, Factory AIconfidence: high
  • instead of micromanaging, what I'm doing is I'm scaffolding and providing context.
    #190 — Eric Hou, Augment Codeconfidence: high
  • identifying problems with the code because if there's no problems then it's probably high quality code
    #179 — Josh Albrecht, Imbueconfidence: high

Specs are not paperwork; they are executable intent

  • specs are natural language, you're using specs as a control surface to explain what you want the system to do.
    #40 — Al Harris, Amazon Kiroconfidence: high
  • leaving breadcrumbs, documentation, ADRs, persona oriented documentation around what a good job looks like.
    #16 — Ryan Lopopolo, OpenAIconfidence: high

The practical unit of AI coding is the codebase, not the snippet

  • snippets and my last project was generating an entire codebase.
    #72 — Naman Jain, Cursorconfidence: high
  • agents MD files an open standard
    #57 — Eno Reyes, Factory AIconfidence: high
  • codebase for harness engineering
    #16 — Ryan Lopopolo, OpenAIconfidence: high

Evals are a control system, not just a test suite

  • improvement without measurement is limited and imprecise.
    #125 — Ido Pesok, Vercel v0confidence: high
  • We still want to build reliable scalable applications and that is still hard
    #184 — Samuel Colvin, Pydanticconfidence: high
  • eval to us it's actually the same problem from a from a systems perspective.
    #628 — Phil Hetzel, Braintrustconfidence: high
  • small CLI tool that we call eval tool
    #689 — Lawrence Jones, incident.ioconfidence: high
  • designed to allow agents to leverage our eval suite files.
    #689 — Lawrence Jones, incident.ioconfidence: high
  • classic benchmark maxing.
    #746 — Ara Khan, Clineconfidence: high
  • There are right ways to use them. There are wrong ways to use them.
    #746 — Ara Khan, Clineconfidence: high

Realistic evals must be grounded in natural tasks and operational history

  • task should be natural and sourced from the real world and then you should be able to reliably grade them.
    #72 — Naman Jain, Cursorconfidence: high
  • If you build your application in a type safe way, if you use frameworks that allow it to be type safe, you can refactor it with confidence much more quickly.
    #184 — Samuel Colvin, Pydanticconfidence: high
  • Dynamic data sets have real world alignment.
    #153 — Quotient AI + Tavilyconfidence: high

Evals are strongest when they are trace-linked and fed by production observability

  • what is the gap between agent observability and what you're actually building. How do we mind that gap?
    #680 — Amy Boyd & Nitya Narasimhan, Microsoftconfidence: high
  • we go from like a testing and eval paradigm to a monitoring p uh paradigm.
    #655 — Danny Gollapalli & Ben Hylak, Raindropconfidence: high
  • where I've got some big production CI stack to go and run and deployment takes hours, being able to go and change variables in production or in staging very quickly
    #657 — Samuel Colvin, Pydanticconfidence: high
  • download all of the UI that we have as a file system?
    #689 — Lawrence Jones, incident.ioconfidence: high
  • 25 agents in parallel
    #689 — Lawrence Jones, incident.ioconfidence: high
  • it's actually the telemetry that does that.
    #750 — Dat Ngo, Arizeconfidence: high

Activity-based metrics misread motion as progress in AI-augmented work

  • these are not productivity metrics. They're useful, but you cannot just kind of use them like maximize them to maximize developer productivity.
    #79 — Yegor Denisov-Blanch, Stanford (120k devs study)confidence: high
  • I do think that AI increases developer productivity, but there's also cases in which it decreases developer productivity.
    #195 — Yegor Denisov-Blanch, Stanford (100k devs study)confidence: high
  • just plain old PR throughput. How many pull requests does the average engineer merge per week?
    #101 — Nick Arcolano, Jellyfish (20M PRs)confidence: high
  • I'm going to talk about how we pay engineers. And we pay engineers like salespeople.
    #63 — Arman Hezarkhani, Tenexconfidence: high

Problem framing and review become the scarce skills once execution is cheap

  • the new scarce skill is writing specifications that fully capture the intent
    #265 — Sean Grove, OpenAIconfidence: high
  • intentionally designed to put friction
    #14 — Armin Ronacher & Cristina Poncela Cubeiroconfidence: high
  • vibes aren't going to fix
    #132 — Chris Kelly, Augment Codeconfidence: high
  • I'm declaring war on slop today.
    #59 — swyxconfidence: high

The best evals encode judgment mined from operational history, not invented in a clean room

  • take a real codebase, crawl its commit history, find the commits that fixed actual problems, and turn each fix into a graded task
    #60 — Govind Jain, Stripeconfidence: high
  • handle state potentially over long periods of time. There needs to be human interaction for approvals
    #167 — Preeti Somal, Temporalconfidence: high
AI QUALITY · CHAPTER 04 · MASH JUDGES
1 unsupported claim — ship-blocker

scored on version git:2f2668c

EVIDENCE OF SOURCE · CHAPTER 05
FIG. 05 · BEFORE · AFTERCLICK · SCROLL · ZOOM

AI Engineer Book · Ch 05

Context Is Infrastructure

1/6

FIG. 05.0 · OPENER

Stuffing the window vs assembling context

Click to enlarge

CH05

CH. 05 // Drafting
3,083 words13 min read
CHAPTER 05/3,083 words/Drafting

Chapter 5 — Context Is Infrastructure

Useful AI systems do not fail only because the model is weak. They fail because the system cannot assemble the right working set of information at the right moment, in the right shape, at a cost the product can bear.

For a while, context looked like a prompt-trick problem. You had a box, a token limit, and a growing collection of devices for stuffing more things into it. Add a few retrieved documents. Paste the spec. Prepend some examples. Tell the model to think harder. Each move felt like progress because each one was visible: more characters in, more confidence out.

But the framing was upside down.

Context is not the garnish around intelligence. It is the substrate that determines what the system can even notice. Two systems with identical model weights can differ enormously in usefulness based purely on what reaches the model and what shape that information arrives in. The chapter that follows is about treating that substrate the way engineering treats other substrates: as infrastructure, with versions, budgets, observability, and failure modes you have a plan for.

Context is the substrate, not the garnish

Figure 05.1/RAG is not memoryCLICK TO ENLARGE

The earliest framings of prompt engineering trained a generation of builders to think of context as text you assemble in a string. That worked when the system was a chatbot answering one question and the right context could fit in a single window. It stops working the moment the system has to act over real workflows, with real users, against real data, for more than a few turns.

Val Bercovici names the shift directly. Context platform engineering, he calls it, is "the set of skills and tools to design, size, and configure systems optimized for agent swarm context, at any scale." The phrase is dense, but the move is clear. The thing being engineered is no longer the prompt. It is the platform that decides what gets put into prompts.

Ofer Mendelevitch's enterprise deep-research framing pushes the argument to its limit. The hard problem of enterprise AI, he says, is not access to documents. It is access to the relevant documents — the ones the agent actually needs for the current step, ranked, deduplicated, and trustworthy. Once you accept that frame, prompt engineering becomes a small subset of a much larger discipline.

Stuffing context is not memory

Figure 05.2/GraphRAG connects the dotsCLICK TO ENLARGE

Jack Morris has one of the sharpest one-liners in the corpus: "Stuffing context is not memory."

The line is so quotable it can be mistaken for a slogan. It is actually a load-bearing about architecture.

A system that places relevant documents into a prompt has accomplished retrieval, not remembered anything. The model has no commitment to those documents after the response is generated. Nothing persists. Nothing accumulates. The next turn starts from the same blank slate, and any continuity the user experiences is illusion — manufactured by re-stuffing the window with carefully chosen artifacts.

Memory, by contrast, requires durable structure. It requires deciding what to retain, what to summarize, what to forget, what to refresh, and what to surface unprompted. It requires a model of the entities the system has encountered and the relationships between them. It requires a way to update old beliefs when new evidence arrives. None of those properties emerge from pushing more text into a window.

The reason this distinction matters is that the failures look identical from the outside. A system that retrieves badly and a system that forgets can both surface a wrong answer about a customer the agent talked to yesterday. The difference shows up in the fix. Better retrieval can patch the first; only better memory architecture can patch the second.

Daniel Chalef at Zep makes a related point with a more pointed framing: stop using RAG as memory. The error he sees in production systems is not RAG itself but RAG carrying weight it was never designed to carry — long-term user state, evolving entity facts, cross-session continuity. RAG is good at fetching documents. It is bad at maintaining a model of the user across months.

Read these two together as a selection rule. Reach for retrieval when the job is fetching the right documents for the current step; reach for a memory layer when the job is maintaining state that has to persist — long-term user facts, evolving entities, cross-session continuity. Chalef's warning marks the trap: the moment RAG starts carrying that durable state, you have collapsed two layers that need to be designed separately.

RAG, memory, and GraphRAG are different jobs

Figure 05.3/The MCP tool floodCLICK TO ENLARGE

Once the distinction between retrieval and memory is on the table, the next obvious question is whether all retrieval is the same. It is not.

Stephen Chin's work and the broader Neo4j framing argue that retrieving over a graph is a different operation than retrieving over a flat vector index. The vector index excels when the question is about semantic similarity to existing content. The graph excels when the question is about relationships — who depends on whom, which entities co-occur, what does this concept connect to. Mitesh Patel at NVIDIA pushes the framing further with hybrid RAG, which fuses graph and vector retrieval because most useful enterprise questions need both.

David Karam, from his Pi Labs work after Google Search, frames retrieval as a layered problem. You don't pick one technique; you layer them, one query at a time, with each layer handling a class of failure the others miss. Retrieval is not a black box. It is a system of techniques with known trade-offs, and the engineering job is knowing which technique to apply where.

The deeper is that the field's vocabulary has been lagging the architecture. Practitioners say "RAG" and mean four different things depending on context. Sometimes they mean a single embedding lookup before a single prompt. Sometimes they mean a retrieve-then-rerank pipeline. Sometimes they mean a graph traversal that surfaces entities. Sometimes they mean a long-term memory layer. The label has collapsed distinctions that the architecture depends on.

Will Bryk's neural-RAG work at Exa comes at it from the opposite direction. Instead of treating retrieval as something that happens before reasoning, it integrates retrieval into the reasoning loop. The model decides what to search for, what to follow up on, and when to stop. That is also RAG by current usage, but it has almost nothing in common with the embedding-lookup pattern that the term originally described.

The practical consequence is that "we'll just add RAG" is no longer a useful sentence in a design discussion. The question worth asking is which retrieval pattern, against which data shape, with what ranking, with what freshness guarantees, feeding into what reasoning surface. Once those questions are explicit, the architecture becomes engineerable. Until they are, the system relies on luck.

Enterprise usefulness is a working-set problem

Figure 05.4/Misassembly is not hallucinationCLICK TO ENLARGE

The argument so far has been mostly about technique. The harder argument is about value.

Joel Hron's framing — that AI is shifting from helpfulness to producing judgments — looks different inside an enterprise than it does inside a consumer chatbot. Inside the enterprise, the agent's value depends almost entirely on its ability to assemble a small, accurate, current, trustworthy working set out of a much larger, messier corpus. Without that working set, the model is doing impressive cognition over the wrong material, and the answer is wrong in a way that is hard to catch.

Mendelevitch's enterprise deep-research framing is built around this exact failure mode. The corpus is huge, the user query is broad, and the system has to converge on the few hundred passages that actually matter for the question at hand. The work that produces value is the convergence, not the cognition that happens after.

Calvin Qi at Harvey and Chang She at Lance describe the same pattern from inside legal work. Lawyers do not need an agent that can read everything. They need an agent that can find the specific clause, the specific precedent, the specific exception that bears on the matter at hand. The retrieval has to separate authoritative sources from background material, and it has to surface provenance so the lawyer can verify what the system found before relying on it.

Chau Tran's Glean work generalizes this across enterprises. The enterprise-aware agent, in his framing, is not one that has access to the company's documents. It is one that knows which documents matter for the current user, the current role, the current task — and which to ignore. The boundary work is the engineering work.

The unifying across these talks is that enterprise usefulness scales with working-set quality, not with corpus size. A larger corpus without better assembly produces worse outcomes, not better ones. A smaller corpus that is well-ranked, well-scoped, and provenance-tagged often outperforms a larger one that is dumped in raw.

This is also where Chapter 4's argument should still be echoing. Evals can measure whether an answer is right. Context architecture determines whether the right answer was even available to the model in the first place. Without the substrate, the eval is measuring guesses.

The next failure frontier is context misassembly

Figure 05.5/“RAG” is four different jobsCLICK TO ENLARGE

For most of the public conversation about AI quality, hallucination has been the boogeyman. The model invented a citation. The model fabricated a fact. The model imagined a precedent that does not exist. Hallucination is real and worth measuring, but it is becoming the wrong primary failure mode to worry about.

The newer, more expensive failure mode is context misassembly.

Context misassembly is what happens when the system retrieves real documents, in the wrong combination, with the wrong weighting, at the wrong moment, and produces an answer that is technically grounded but practically misleading. Nothing is hallucinated. Every cited source exists. Every quote can be verified. But the assembled context misrepresents the underlying state of the world because the assembly missed something, ranked something poorly, or surfaced an outdated version of a document the system also has the current version of.

Morris's distinction between stuffing and memory points at one form of this. The system surfaces three documents about the customer, two of them stale, one of them current. The model averages them and produces an answer that is half-current. Nothing is invented; nothing is correct either.

Ivan Leo's Manus AI research-agent work — now under Meta Superintelligence — surfaces the same problem at a different scale. A deep-research agent that pulls hundreds of sources can produce a summary that is internally consistent and externally wrong because the assembly drifted as the agent worked. Each individual retrieval was fine. The composition was off.

Karam's layered-RAG approach is partly a response to this. By treating retrieval as a multi-pass operation with distinct failure modes per layer, the system gives misassembly multiple chances to be caught by a downstream layer. That works, partially, but it does not change the underlying architectural fact: context misassembly is a structural failure mode, and the systems that are starting to dominate production are the ones that designed for it explicitly.

The reason this matters for the book's overall argument is that misassembly does not get fixed by a better model. A better model produces a more confident wrong answer faster. The fix lives in the substrate — in how the system assembles, ranks, deduplicates, and freshens the context before the model is asked to reason over it. That is what makes an infrastructure problem instead of a prompt problem.

MCP makes context a capability problem too

The rise of the has expanded what context includes. An -connected agent has access not only to documents but to tools — APIs, search endpoints, file operations, internal services, and increasingly, other agents. Each of those tools shows up in the context window as a capability description: a name, a schema, an example. The window now contains both the information the system might consult and the actions the system might take.

That expansion brings a new failure mode.

Matt Carey's mega-context-problem talk identifies the failure cleanly: "We shouldn't be dumping loads of tools into context." When an agent has access to fifty tools, the capability descriptions for those tools can flood the window before the user's question is even processed. The model spends its attention budget reading tool definitions and has less budget for the actual reasoning the user wanted. Worse, the abundance of tools tempts the model into calling the wrong one because it now has to disambiguate between options that all look superficially relevant.

Sam Morrow's GitHub -scaling work tells the production-grade version of this story. GitHub reduced the initial tool-load context and then kept reducing both input and output token usage through grouping, tailoring, and intent-aware tool design. The number of tools the agent could in principle call did not go down. The number of tools the agent had to read at any given moment did. That difference is the engineering.

Karan Sampath at Anthropic, working on enterprise rollouts, points at the same dynamic from the governance side. The enterprise version of this problem is not just performance. It is trust. A capability surface that the security team cannot inspect and reason about is one the security team will not approve. The capability problem is also a context-design problem because the window is the place where capability is exposed.

The unifying here is that the moment tools entered the context window, became broader than retrieval; it also became capability management. The directive the production cases point to: expose tools by intent rather than all at once, describe them tightly, and retract them when they stop being relevant — so you shrink the number of tools the agent must read at any moment, not the number it can call.

Progressive discovery is infrastructure, not UX

The natural reaction to the tool-flood problem is to surface fewer things up front and let the agent discover more as it needs them. That instinct is right. The harder is that progressive discovery deserves to be treated as infrastructure rather than a UX nicety.

Bercovici's context-platform framing makes this argument structurally. The right amount of context to expose at any given step is a function of the step, the agent, the user, and the task — not a static property of the system. That is exactly the shape of an infrastructure problem. It needs a layer that owns the decision, observes the outcome, and updates over time.

Sam Morrow's GitHub work is the closest thing the corpus has to a production case study for this. The team did not just trim the tool list. They built grouping, tailoring, and intent-aware exposure into the server itself. The agent does not see every tool; it sees the tools the server has decided are relevant given the agent's intent. The decision lives in the server, not the prompt.

The deeper point is that progressive discovery does work the model cannot do for itself. The model can reason about the tools it is shown. It cannot reason about tools that are absent from the window. Whatever decides which tools to show is doing structural work, and that work is part of the system's architecture whether or not anyone calls it that.

This is also where the book's earlier about — Chapter 3 — start cross-loading. A without context-platform thinking is a without a context strategy. Once you accept that, "agent-ready codebase" stops meaning "a repository with good tests" and starts meaning "a repository whose tools, documents, and state are exposed at the right moments, in the right shapes, to the right consumers." That is a discipline. It is not a feature flag.

Why context is infrastructure

Context is the substrate, not the garnish.

If context is a substrate, then is the discipline of building, maintaining, and observing that substrate. That discipline includes retrieval architecture, memory architecture, capability management, progressive disclosure, freshness handling, provenance, and the cost and latency budgets that govern when each of those mechanisms fires. It is not a thing one component does. It is a layer of the system that has its own state and its own failure modes.

A team that treats context as a one-off prompt-assembly problem will keep finding the same failures and keep fixing them with one-off measures. Add a reranker here. Patch a deduplication bug there. Re-tune a chunk size when retrieval starts missing the relevant clause. None of those are wrong, but none of them add up to an architecture.

A team that treats context as infrastructure builds for the failure modes the chapter has named: misassembly, capability flood, memory drift, provenance loss. They version the context layer. They observe what it surfaces. They measure how the model's outputs change when the context layer changes. They treat the context platform the way a database team treats the storage engine — as a piece of working infrastructure whose properties shape everything built on top of it.

That is the move this chapter has been arguing for. It is also the move that prepares the reader for the next chapter. Once you take context seriously as infrastructure, you immediately notice that the substrate cannot live entirely in a single agent run. It has to persist across sessions, recover from interruptions, and survive partial failure. That is the runtime problem, and it is where the book goes next.

What to do with this

  • Treat context as a versioned, observable layer with its own budgets and failure modes, not a string you assemble per request.
  • Don't use RAG as your memory layer: retrieve to fetch the right documents for a step, and design a separate memory architecture for durable user and entity state across sessions.
  • Match the retrieval pattern to the question — vector search for semantic similarity, graph traversal for relationship questions, hybrid when the query needs both — and make ranking, data shape, and freshness explicit before you build.
  • Optimize for working-set quality, not corpus size: a smaller, well-ranked, provenance-tagged set beats a larger raw dump.
  • Design for context misassembly, not just hallucination — dedupe, freshen, and weight retrieved documents so stale-and-current sources can't average into a half-right answer. A stronger model will not fix this.
  • Don't flood the window with tools: expose them by intent and retract them when irrelevant, shrinking the number of tools the agent must read at any moment rather than the number it can call.
EVIDENCE OF SOURCE · CHAPTER 05 · VIDEOS

9 claims · 39 source anchors

Evidence — Source Anchors

Realistic evals must be grounded in natural tasks and operational history

  • task should be natural and sourced from the real world and then you should be able to reliably grade them.
    #72 — Naman Jain, Cursorconfidence: high
  • If you build your application in a type safe way, if you use frameworks that allow it to be type safe, you can refactor it with confidence much more quickly.
    #184 — Samuel Colvin, Pydanticconfidence: high
  • Dynamic data sets have real world alignment.
    #153 — Quotient AI + Tavilyconfidence: high

Context failure is often a system-assembly problem, not simply a small-context-window problem

  • the reason context platform engineering is so important is it dramatically simplifies reaching maximum KV cache hit rates
    #104 — Val Bercovici, WEKAconfidence: high
  • connect the dots with graph technology and solve problems like context engineering
    #105 — Stephen Chin, Neo4jconfidence: high
  • irrelevant facts pollute memory.
    #218 — Daniel Chalef, Zepconfidence: high
  • LLMs and tools are orchestrated through predefined code paths.
    #193 — Chau Tran, Gleanconfidence: high
  • Agents look at the starting point, end point and try to provide you the results.
    #752 — Nupur Sharma, Qodoconfidence: high
  • the more the tools, the more issues you have.
    #752 — Nupur Sharma, Qodoconfidence: high

The context gap increasingly includes capability packaging and progressive disclosure

  • doesn't have to be loaded immediately to context.
    #683 — Pedro Rodrigues, Supabaseconfidence: high
  • specifically with progressive disclosure.
    #654 — Nick Nisi & Zack Proser, WorkOSconfidence: high
  • 49% reduction of the initial load.
    #625 — Sam Morrow, GitHubconfidence: high
  • rich interactive components that render directly in the chat.
    #747 — Marlene Mhangami & Liam Hampton, GitHubconfidence: high

Harness quality now includes capability packaging, not only repo hygiene

  • That's what a skill is. You're teaching the the LLM how to do something in the way that you expect it to be done
    #654 — Nick Nisi & Zack Proser, WorkOSconfidence: high
  • This is how the agent is
    #683 — Pedro Rodrigues, Supabaseconfidence: high
  • 49% reduction of the initial
    #625 — Sam Morrow, GitHubconfidence: high
  • the schema is the UI for the agent.
    #744 — Michael Hablich, Google (Chrome DevTools)confidence: high

Context failure is often a capability-exposure problem, not only a retrieval problem

  • MCP versus skill debate
    #683 — Pedro Rodrigues, Supabaseconfidence: high
  • you can do it in a better way. And that is specifically with progressive disclosure.
    #654 — Nick Nisi & Zack Proser, WorkOSconfidence: high
  • grouping concept of related product
    #625 — Sam Morrow, GitHubconfidence: high

Context engineering is a primary engineering discipline, not a prompt trick

  • picking up the right documents and answering those questions is a really cool use case.
    #100 — Ofer Mendelevitch, Vectaraconfidence: high
  • cool load generator that Kalen wrote that lets you configure agent swarms uh and agent subtasks with very specific SLOs's
    #104 — Val Bercovici, WEKAconfidence: high
  • connect the dots with graph technology and solve problems like context engineering
    #105 — Stephen Chin, Neo4jconfidence: high
  • the right agent in the future is going to be this system that decides what type of search
    #157 — Will Bryk, Exa.aiconfidence: high

RAG, memory, and GraphRAG solve different jobs; collapsing them into one bucket misses the architecture

  • rag or retrieval augmented generation where you have so many things that you can't fit them all in
    #48 — Jack Morrisconfidence: high
  • why you need to model your memory after your business domain.
    #218 — Daniel Chalef, Zepconfidence: high
  • the basic construct of a knowledge graph is um nodes which represent different people in the situation, relationships, and then you can attach properties to these nodes.
    #105 — Stephen Chin, Neo4jconfidence: high
  • we want to look at patterns for successful graph applications uh for um making LLMs a little bit smarter by putting knowledge graph into the picture.
    #215 — Michael, Jesus & Stephen, Neo4jconfidence: high
  • how can we create a graph rack system what are the advantages of it and if we add the hybrid nature to it how it is helpful
    #219 — Mitesh Patel, NVIDIAconfidence: high
  • you need to be like tuned to what what every technique gives you before you go and invest in it.
    #156 — David Karam, Pi Labsconfidence: high
  • retrieval is not just vector search.
    #756 — Kuba Rogut, Turbopufferconfidence: high

Enterprise usefulness scales with working-set quality, not corpus size

  • about 73% of LM customers implementing use cases say that factual accuracy is their top challenge right now.
    #100 — Ofer Mendelevitch, Vectaraconfidence: high
  • how Harvey tackles retrieval, the types of problems there are and then the challenges that come up with that all with like retrieval quality, scaling, uh security,
    #154 — Calvin Qi (Harvey) & Chang She (Lance)confidence: high
  • how to build enterprise aware agents. How to bring the brilliance of AI into the messy complex realities
    #193 — Chau Tran, Gleanconfidence: high
  • you don't need a trillion at once, you need the right million.
    #756 — Kuba Rogut, Turbopufferconfidence: high

The next failure frontier is context misassembly, not just hallucination

  • there's this third thing, which I think is like really new and no one is doing it yet, which is training things into weights.
    #48 — Jack Morrisconfidence: high
  • this is really useful if you're building anything related to some sort of internal deep research sort of API
    #47 — Ivan Leo, Manus AI / Meta Superintelligenceconfidence: high
  • you combine it with all your other signals. So now if you look at your ranking function
    #156 — David Karam, Pi Labsconfidence: high
  • it's hybrid search because you have multiple approaches, and then you can either boost them together. You could do reranking, which is becoming more and more popular.
    #172 — Philipp Krenn, Elasticconfidence: high
AI QUALITY · CHAPTER 05 · MASH JUDGES

scored on version git:2f2668c

EVIDENCE OF SOURCE · CHAPTER 06
FIG. 06 · BEFORE · AFTERCLICK · SCROLL · ZOOM

AI Engineer Book · Ch 06

Runtimes, State, and the Human Control Plane

1/5

FIG. 06.0 · OPENER

Stateless loop vs durable runtime

Click to enlarge

CH06

CH. 06 // Drafting
2,194 words10 min read
CHAPTER 06/2,194 words/Drafting

Chapter 6 — Runtimes, State, and the Human Control Plane

A chatbot can get away with amnesia. A production agent cannot.

Short-lived assistance can live inside a conversational loop — a turn comes in, a turn goes out, and nothing has to survive past the reply. Delegated work cannot live there. The moment a system has to persist across retries, timeouts, approvals, multiple tools, and possibly many parallel workers, the central problem stops being next-token cleverness and becomes execution semantics. The system has to preserve state, survive interruption, expose its progress, and resume without losing the thread. This is the chapter where the book's argument turns from what the model knows to how the work itself stays alive between the moments the model is thinking.

Samuel Colvin at Pydantic states the lesson in the voice of someone who learned it in production: "building production AI agents reveals a harsh truth — stateless architectures that work for simple demos become impossibly painful at scale." The word painful is doing honest work. The failure is not dramatic. It is the slow accumulation of edge cases — the timeout that loses an hour of work, the retry that double-charges, the approval that arrives after the agent already acted — that turns a magical demo into a system nobody trusts.

Why stateless demos break

Figure 06.1/Transcript vs workflowCLICK TO ENLARGE

The seductive thing about a chat-loop agent is that it works immediately. You wire a model to some tools, let it loop, and on a short, happy-path task it looks like the future. The trouble is that the architecture has no memory of what has actually happened — only a transcript of what was said. And a transcript is not state.

The distinction matters the instant anything goes wrong. A real delegated task runs long enough to hit a timeout, a rate limit, a flaky tool, a network blip, a required human approval. In a stateless loop, every one of those is a small catastrophe, because there is no durable record of where the work had gotten to. The system cannot answer the only questions that matter under failure: what has been done, what is in flight, and what is safe to retry. It can replay the conversation, but it cannot resume the work, because it never modeled the work as something separate from the conversation about it.

This is why so many impressive agent demos collapse when a team tries to operationalize them. The model is capable, the prompts are decent, the context is strong — and the system still falls over, because it was built like a chat session when the job required a workflow. So run the test before you scale a demo: ask whether the system can answer what has been done, what is in flight, and what is safe to retry without replaying the conversation. If the only record is a transcript, you have a chat session, not a workflow — and the fix is not a smarter model but a different architecture underneath.

Durable execution is the runtime requirement

Figure 06.2/The human control planeCLICK TO ENLARGE

The architecture that survives reality is durable execution: treat the long-running task as structured, checkpointed execution rather than a growing transcript. Preeti Somal at Temporal puts the stakes plainly: agentic systems "must scale and provide durability and reliability — otherwise, no one's going to trust your agent." Trust, in her framing, is not a property of the model's answers. It is a property of the runtime's behavior under failure.

What durability buys is the ability to distinguish what was said from what has actually happened. A records each completed step, so that when a tool call times out on the ninth step of a twelve-step task, the system resumes from the ninth step rather than restarting from zero or, worse, redoing a side effect that already fired. Somal describes pushing the reliability semantics down into the runtime so they leave the prompt entirely: "nowhere in there will there be statements like, if something fails keep retrying it — all of those pieces are handled" by the execution layer. That is the architectural move. Retries, timeouts, and resumption stop being things the agent has to reason about turn by turn and become guarantees the runtime provides. The prompt gets to be about the task; the runtime takes care of survival.

The same talk names the second thing durability gives you, and it points straight at the next section: "we also store all of the workflow history, so that you can look at the visibility of what is happening as your agent is navigating this complex set of interactions." History is not just for recovery. It is for inspection — and inspection is where humans re-enter the system.

The human control plane is an architectural layer

Figure 06.3/Subagents need recompositionCLICK TO ENLARGE

It is tempting to treat human oversight as a temporary crutch — something you keep around until the models get good enough to remove it. The corpus argues the opposite. In high-value systems, human control is not a phase on the way to full autonomy. It is a permanent architectural layer, designed in rather than bolted on.

Joel Hron at Thomson Reuters gives the clearest frame for what that layer regulates. Autonomy, he argues, is not a switch you flip but a dial you tune: agency is "not a binary thing but a lever that you can dial" up or down depending on how irreversible, risky, and observable the work is. A low-stakes draft can run with the dial turned far up. A filing with professional consequences runs with it turned down, with explicit approval points where a human re-enters before the system acts. The design question is never "autonomous or not." It is "how much agency, at which steps, with what review."

The mechanism that makes those approval points workable is exactly the workflow history durability provides. Hron describes deep-research systems whose long-running behavior becomes "the trajectories that the model would be following along its path of answering this particular type of legal question" — inspectable paths a human can audit rather than opaque jumps from question to answer. The control plane is built from these surfaces: the approval gates where a human authorizes the next step, the roll-up views that show what the system is doing without drowning the reviewer in raw agent chatter, the trajectory and history records that make an action reconstructable after the fact. Eric Zakariasson at Cursor describes the roll-up form of this directly — the human needs "an overview of the processes," a single surface answering what every worker is doing and what actually needs a person's attention, rather than a firehose of individual logs.

This is the same human-judgment-at-the-edges principle the book keeps returning to, now given a concrete home in the runtime. The point of the control plane is not to slow the system down. It is to focus scarce human attention on the consequential moments and let the runtime carry everything else — which gives you a design test for any review surface: if it surfaces raw agent chatter instead of a roll-up of what needs a person, or if it gates a reversible low-stakes step the way it gates an irreversible one, it is spending human attention in the wrong place.

Parallelism raises the stakes on coordination

Figure 06.4/Agency is a dial, not a switchCLICK TO ENLARGE

Everything so far concerns a single durable worker. The architecture gets harder, and more interesting, the moment teams reach for many workers at once — and this is where the corpus is most unsettled.

The appeal is obvious: if one agent is leverage, a fleet is more leverage. OpenAI's Codex team describes the mechanism, spinning off "a master task into decomposable parallel and independent tasks." But the precondition is in that phrase: parallelize only work that is genuinely decomposable and independent, and only after you have a way to recompose, inspect, and route the output back to a human at the right moment. Add workers before you have that, and parallelism just manufactures chaos faster — more diffs, more conflicts, more review than anyone can absorb, which is exactly the alignment-debt failure Chapter 9 will name at organizational scale. Teams go wrong treating "spin up more agents" as the win when the recomposition layer is the actual bottleneck.

Lou Bichard at Ona sharpens the diagnosis to a single missing piece. The runtime, he argues, is solved — "there are many options for this now, sandboxes and containers" — and so are triggers and orchestration. "The thing that's missing," he says, "is coordination": the agent-native primitive that lets parallel workers pick up tasks, signal completion, and hand off without a human stitching them together. And he is pointed about what is not that primitive: "GitHub is not a coordination layer for agents — it gets incredibly overwhelming." His candidate building blocks are exactly this chapter's subject — "state machines, by building out workflows and effectively state machines" — which lands the parallelism question squarely back on durable execution.

What makes the corpus credible here is that the teams shipping production multi-agent systems have not agreed on an answer; they have each substituted a known mechanism for the missing one. Factory runs features serially with one active writer — "serial execution with targeted internal parallelization" — eliminating the coordination problem by construction, and reports a longest mission of sixteen days. Anthropic's long-running agents take a third path: a planner-generator-evaluator loop where each role gets "its own kind of context window" and the agents "negotiate what done actually means" through a contract written to files on disk before any code is produced — a capability curve their team traces from roughly a one-hour autonomous run to twelve hours on the same simple . Serial execution, file-based contracts, state machines plus durable execution: three substitutes for one primitive that does not yet exist. The honest chapter names the gap and shows the three things teams actually ship, rather than pretending there is a consensus building block.

The runtime is where intelligence becomes dependable

Pull the chapter together and its is structural. A machine colleague is not a model with tools attached. It is a model inside an operating environment — and that environment is what determines whether bursts of intelligence become dependable delegated work.

The environment has named parts now. Durable execution, so the work survives interruption and resumes instead of restarting. Explicit state and workflow history, so the system can answer what has actually happened. A of approvals, roll-up views, and inspectable trajectories, so people supervise at the consequential edges. And, once there are many workers, a coordination story — even if today that story is a chosen substitute rather than a solved primitive. None of these live in the model. All of them live in the runtime, and all of them are the difference between a demo and a system.

The moment a durable, long-running agent can act on its own — with state, tools, and the authority to use them over time — bounding that authority becomes the price of letting it act at all. Identity, permissions, sandboxes, audit: that is the next chapter's subject. Durability gave the agent staying power. Security decides what it is allowed to do with it.

What to do with this

  • Before you scale any agent demo, run the failure-state test: can the system answer what has been done, what is in flight, and what is safe to retry — without replaying the conversation? If the only record is a transcript, you have a chat session, not a workflow. The fix is a different architecture, not a smarter model.
  • Push retry, timeout, and resumption semantics down into the runtime so they leave the prompt entirely — aim for a state where "if something fails keep retrying it" appears nowhere in your prompt because the execution layer handles it. Checkpoint each completed step so a tool timeout on step nine resumes from step nine, not from zero and not by re-firing a side effect that already ran.
  • Store full workflow history, and use it for two distinct jobs: recovery (resuming after failure) and inspection (a human auditing what the agent actually did). Treat the trajectory record as the surface humans review, not an afterthought.
  • Stop asking "autonomous or not" and instead set the agency dial per step — tuned by how irreversible, risky, and observable that step is. Turn it up for a low-stakes draft; turn it down with an explicit approval gate before an action with real consequences.
  • Audit your review surfaces: if one shows raw agent chatter instead of a roll-up of what needs a person, or gates a reversible step as strictly as an irreversible one, it is spending scarce human attention in the wrong place. Build the roll-up overview that answers what every worker is doing and what actually needs review.
  • Don't reach for more parallel workers until you have a recomposition layer — coordination is the missing primitive, and GitHub is not it. Until an agent-native coordinator exists, pick a deliberate substitute: serial execution with one active writer, file-based "negotiate what done means" contracts, or state machines plus durable execution.
EVIDENCE OF SOURCE · CHAPTER 06 · VIDEOS

21 claims · 88 source anchors

Evidence — Source Anchors

The important transition is from suggestion to delegated execution

  • from helpfulness to productive
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • I think they need more
    #3 — Jacob Lauritzen, Legoraconfidence: high
  • most primitives the magic happens when you combine these things together
    #138 — Sam Bhagwat, Mastra.aiconfidence: high

Chat is an insufficient control surface for long-running or high-stakes work

  • Chat is one-dimensional. It's a very low bandwidth interface,
    #3 — Jacob Lauritzen, Legoraconfidence: high
  • we're asking AI systems to now produce output and produce judgments and decisions
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • handle state potentially over long periods of time. There needs to be human interaction for approvals
    #167 — Preeti Somal, Temporalconfidence: high

Reliability comes less from model cleverness than from surrounding scaffolding

  • The important thing is not the code but the prompt and the guardrails that got you there.
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • Agents have intelligence and capabilities, but not always expertise that we need for real work.
    #83 — Barry Zhang & Mahesh Murag, Anthropicconfidence: high
  • these are three kind of like ingredients which are pretty simple and pretty basic, but I think provide an interesting kind of like first principles approach for how to think about
    #198 — Harrison Chase, LangChain/LangGraphconfidence: high

Harness quality is a major determinant of coding-agent quality

  • a good harness is really operationalized around giving the model text at the right time
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • there's so much work that has been put in uh over the last you know 20 to 30 years around the automated validation and verification of software that you build
    #57 — Eno Reyes, Factory AIconfidence: high
  • instead of micromanaging, what I'm doing is I'm scaffolding and providing context.
    #190 — Eric Hou, Augment Codeconfidence: high
  • identifying problems with the code because if there's no problems then it's probably high quality code
    #179 — Josh Albrecht, Imbueconfidence: high

Specs are not paperwork; they are executable intent

  • specs are natural language, you're using specs as a control surface to explain what you want the system to do.
    #40 — Al Harris, Amazon Kiroconfidence: high
  • leaving breadcrumbs, documentation, ADRs, persona oriented documentation around what a good job looks like.
    #16 — Ryan Lopopolo, OpenAIconfidence: high

Evals are a control system, not just a test suite

  • improvement without measurement is limited and imprecise.
    #125 — Ido Pesok, Vercel v0confidence: high
  • We still want to build reliable scalable applications and that is still hard
    #184 — Samuel Colvin, Pydanticconfidence: high
  • eval to us it's actually the same problem from a from a systems perspective.
    #628 — Phil Hetzel, Braintrustconfidence: high
  • small CLI tool that we call eval tool
    #689 — Lawrence Jones, incident.ioconfidence: high
  • designed to allow agents to leverage our eval suite files.
    #689 — Lawrence Jones, incident.ioconfidence: high
  • classic benchmark maxing.
    #746 — Ara Khan, Clineconfidence: high
  • There are right ways to use them. There are wrong ways to use them.
    #746 — Ara Khan, Clineconfidence: high

Realistic evals must be grounded in natural tasks and operational history

  • task should be natural and sourced from the real world and then you should be able to reliably grade them.
    #72 — Naman Jain, Cursorconfidence: high
  • If you build your application in a type safe way, if you use frameworks that allow it to be type safe, you can refactor it with confidence much more quickly.
    #184 — Samuel Colvin, Pydanticconfidence: high
  • Dynamic data sets have real world alignment.
    #153 — Quotient AI + Tavilyconfidence: high

Context failure is often a system-assembly problem, not simply a small-context-window problem

  • the reason context platform engineering is so important is it dramatically simplifies reaching maximum KV cache hit rates
    #104 — Val Bercovici, WEKAconfidence: high
  • connect the dots with graph technology and solve problems like context engineering
    #105 — Stephen Chin, Neo4jconfidence: high
  • irrelevant facts pollute memory.
    #218 — Daniel Chalef, Zepconfidence: high
  • LLMs and tools are orchestrated through predefined code paths.
    #193 — Chau Tran, Gleanconfidence: high
  • Agents look at the starting point, end point and try to provide you the results.
    #752 — Nupur Sharma, Qodoconfidence: high
  • the more the tools, the more issues you have.
    #752 — Nupur Sharma, Qodoconfidence: high

Durable state and workflow semantics are trust features, not backend details

  • once we get into longer running workflows, that's where it really becomes a problem.
    #99 — Samuel Colvin, Pydanticconfidence: high
  • no one's going to trust your agent.
    #167 — Preeti Somal, Temporalconfidence: high
  • the workflow orchestration layer needs to be deterministic. So it can be rerun um in a in a uh deterministic fashion
    #44 — Peter Wielander, Vercelconfidence: high
  • where I've got some big production CI stack to go and run and deployment takes hours, being able to go and change variables in production or in staging very quickly
    #657 — Samuel Colvin, Pydanticconfidence: high
  • you'll be able to assemble agent teams that can complete tasks orders of magnitude harder than what you can complete with a single agent today.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • minding the gap around observability.
    #680 — Amy Boyd & Nitya Narasimhan, Microsoftconfidence: high

Human oversight works best as an architectural layer, not an afterthought

  • There needs to be human interaction for approvals or other reasons and of course they need to be able to be uh able to run in parallel for efficiency
    #167 — Preeti Somal, Temporalconfidence: high
  • dial these agency dials far up.
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • maintaining a factory would require you to have an overview of the processes you want your coding agents to go through.
    #629 — Eric Zakariasson, Cursorconfidence: high

High-stakes systems tune agency instead of maximizing it

  • a binary thing but as a lever that you can dial
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • agentic workflows we can plan and execute
    #201 — Yogendra Miraje, Factsetconfidence: high
  • send it to me for approval.
    #202 — Rita Kozlov, Cloudflareconfidence: high
  • credentials, payments, and checkout require determinism.
    #745 — Steve Kaliski, Stripeconfidence: high

The harness is evolving from a local loop into a staged software factory

  • getting to a place where you can build your own like software factory
    #629 — Eric Zakariasson, Cursorconfidence: high
  • unified agent harness that will manage
    #632 — Vaibhav Srivastav & Katia Gil Guzman, OpenAIconfidence: high
  • parallel agents working together to fix
    #42 — Robert Brennan, OpenHandsconfidence: high
  • The difference with missions is that we run features serially.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • Our longest mission ran for 16 days
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • We just kind of gave each role its own kind of context window.
    #691 — Ash Prabaker & Andrew Wilson, Anthropicconfidence: high
  • it's no longer about the model or the agent. It's about the process.
    #743 — Vincent Koc, OpenClawconfidence: high

The context gap increasingly includes capability packaging and progressive disclosure

  • doesn't have to be loaded immediately to context.
    #683 — Pedro Rodrigues, Supabaseconfidence: high
  • specifically with progressive disclosure.
    #654 — Nick Nisi & Zack Proser, WorkOSconfidence: high
  • 49% reduction of the initial load.
    #625 — Sam Morrow, GitHubconfidence: high
  • rich interactive components that render directly in the chat.
    #747 — Marlene Mhangami & Liam Hampton, GitHubconfidence: high

AI-native advantage depends on organizational coherence, not output volume alone

  • you'll be able to assemble agent teams that can complete tasks orders of magnitude harder than what you can complete with a single agent today.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • observing their workflows, their pain points, co-designing solutions with them
    #693 — Eoin Mulgrew, 10 Downing Streetconfidence: high
  • maintaining a factory would require you to have an overview of the processes you want your coding agents to go through.
    #629 — Eric Zakariasson, Cursorconfidence: high

Harness quality now includes capability packaging, not only repo hygiene

  • That's what a skill is. You're teaching the the LLM how to do something in the way that you expect it to be done
    #654 — Nick Nisi & Zack Proser, WorkOSconfidence: high
  • This is how the agent is
    #683 — Pedro Rodrigues, Supabaseconfidence: high
  • 49% reduction of the initial
    #625 — Sam Morrow, GitHubconfidence: high
  • the schema is the UI for the agent.
    #744 — Michael Hablich, Google (Chrome DevTools)confidence: high

Context failure is often a capability-exposure problem, not only a retrieval problem

  • MCP versus skill debate
    #683 — Pedro Rodrigues, Supabaseconfidence: high
  • you can do it in a better way. And that is specifically with progressive disclosure.
    #654 — Nick Nisi & Zack Proser, WorkOSconfidence: high
  • grouping concept of related product
    #625 — Sam Morrow, GitHubconfidence: high

Evals are strongest when they are trace-linked and fed by production observability

  • what is the gap between agent observability and what you're actually building. How do we mind that gap?
    #680 — Amy Boyd & Nitya Narasimhan, Microsoftconfidence: high
  • we go from like a testing and eval paradigm to a monitoring p uh paradigm.
    #655 — Danny Gollapalli & Ben Hylak, Raindropconfidence: high
  • where I've got some big production CI stack to go and run and deployment takes hours, being able to go and change variables in production or in staging very quickly
    #657 — Samuel Colvin, Pydanticconfidence: high
  • download all of the UI that we have as a file system?
    #689 — Lawrence Jones, incident.ioconfidence: high
  • 25 agents in parallel
    #689 — Lawrence Jones, incident.ioconfidence: high
  • it's actually the telemetry that does that.
    #750 — Dat Ngo, Arizeconfidence: high

Coordination is the unsolved runtime primitive for multi-agent systems

  • the thing that's missing for me is coordination.
    #704 — Lou Bichard, Onaconfidence: high
  • through sort of state machines, you know, by building out workflows and effectively state machines
    #704 — Lou Bichard, Onaconfidence: high
  • They step on each other's changes. They duplicate work. They make inconsistent architectural decisions.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • we have the two agents basically negotiate what done actually means.
    #691 — Ash Prabaker & Andrew Wilson, Anthropicconfidence: high

Context engineering is a primary engineering discipline, not a prompt trick

  • picking up the right documents and answering those questions is a really cool use case.
    #100 — Ofer Mendelevitch, Vectaraconfidence: high
  • cool load generator that Kalen wrote that lets you configure agent swarms uh and agent subtasks with very specific SLOs's
    #104 — Val Bercovici, WEKAconfidence: high
  • connect the dots with graph technology and solve problems like context engineering
    #105 — Stephen Chin, Neo4jconfidence: high
  • the right agent in the future is going to be this system that decides what type of search
    #157 — Will Bryk, Exa.aiconfidence: high

RAG, memory, and GraphRAG solve different jobs; collapsing them into one bucket misses the architecture

  • rag or retrieval augmented generation where you have so many things that you can't fit them all in
    #48 — Jack Morrisconfidence: high
  • why you need to model your memory after your business domain.
    #218 — Daniel Chalef, Zepconfidence: high
  • the basic construct of a knowledge graph is um nodes which represent different people in the situation, relationships, and then you can attach properties to these nodes.
    #105 — Stephen Chin, Neo4jconfidence: high
  • we want to look at patterns for successful graph applications uh for um making LLMs a little bit smarter by putting knowledge graph into the picture.
    #215 — Michael, Jesus & Stephen, Neo4jconfidence: high
  • how can we create a graph rack system what are the advantages of it and if we add the hybrid nature to it how it is helpful
    #219 — Mitesh Patel, NVIDIAconfidence: high
  • you need to be like tuned to what what every technique gives you before you go and invest in it.
    #156 — David Karam, Pi Labsconfidence: high
  • retrieval is not just vector search.
    #756 — Kuba Rogut, Turbopufferconfidence: high

Once an AI system can act autonomously, bounding its authority becomes the price of deployment

  • we're asking AI systems to now produce output and produce judgments and decisions
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • most primitives the magic happens when you combine these things together
    #138 — Sam Bhagwat, Mastra.aiconfidence: high
AI QUALITY · CHAPTER 06 · MASH JUDGES
1 unsupported claim — ship-blocker

scored on version git:2f2668c

EVIDENCE OF SOURCE · CHAPTER 07
FIG. 07 · BEFORE · AFTERCLICK · SCROLL · ZOOM

AI Engineer Book · Ch 07

Security, Identity, and High-Stakes Trust

1/5

FIG. 07.0 · OPENER

Unbounded agent vs bounded autonomy

Click to enlarge

CH07

CH. 07 // Drafting
3,694 words16 min read
CHAPTER 07/3,694 words/Drafting

Chapter 7 — Security, Identity, and High-Stakes Trust

Once AI systems can act, trust stops being only a question of model quality. It becomes a question of bounded authority.

A helpful model can get away with being vague about power. An acting system cannot.

The moment an AI system can call tools, execute code, traverse accounts, or continue working after the user has stopped watching, security moves to the center of the product. The important question is no longer whether the model sounds smart. It is whether the system has a clear identity, narrow permissions, real execution boundaries, and enough evidence left behind that a human institution can understand what happened later.

That is why agent security is not mainly a prompt-safety problem. It is a delegated-authority problem. Traditional software security often focused on discrete requests. Agentic systems stretch the unit of risk across a workflow: interpretation, retrieval, tool use, retries, follow-up behavior, and approval boundaries. Trust has to cover the whole path.

The strongest support for this in the current corpus is not a single definitive study. It is a pattern of practitioner convergence. Security talks, identity talks, and high-stakes workflow talks keep arriving at the same practical lesson from different directions: once systems can act on behalf of users, the hard part is no longer only generating a plausible next step. It is controlling what powers were delegated, under what conditions, and with what record of use.

Identity is the substrate of trust

Figure 07.1/Authority boundary collapseCLICK TO ENLARGE

The most common engineering shortcut for connecting an agent to a real system is to give the agent a standing credential — a long-lived API key, a service-account token, a personal access token "borrowed" from the human operator. The shortcut works, in the sense that the system can act. It also dissolves the question the chapter is trying to take seriously.

A standing credential is not a delegation. It is the agent inheriting the entire authority of whoever's credential it borrowed, for as long as the credential remains valid, across whatever surface that credential touches. There is no scope to revoke. There is no expiry that maps to the actual task. There is no record that an agent was acting, rather than a human acting on the same key. The unit of trust the system has just granted is much larger than the unit of work it was asked to do.

Patrick Riley and Carlos Galan at Auth0 frame this directly. Their work on identity for AI agents starts from the position that "we authorize agents, servers" — making the agent and its capabilities first-class citizens in the identity provider, not invisible passengers on a human's credential. The shift looks technical but it is structural. Once the agent is a real principal in the identity layer, it can have its own scopes, its own lifetimes, its own audit footprint, and its own revocation path.

Jared Hanson at Keycard pushes the same argument from the developer side. The common habit of treating an agent like an API consumer — with a permanent key paired to a permanent identity — does not survive contact with workflow reality. Agents act on behalf of specific users, for specific tasks, with specific data. Hanson's case is that securing agents using OAuth is not just a protocol choice; it is the right shape for the delegation. The agent gets a token that is bounded to a session, a user, and a set of scopes. When the work is done, the authority ends.

Garrett Galow at WorkOS pushes the framing further. His talk on cross-app access for proposes the identity provider as a trust bridge between clients and servers — so that the credentials flowing through the agent can be obtained without repeated manual consent flows, and so that the IT organization keeps visibility into which third-party systems are reading which data. That is the structural answer to a question every team running agents at scale eventually faces: how do you let an agent move between tools without making each tool feel like an independent integration project?

The shared move across these three talks is to treat agent identity as a first-class engineering object — not a label, not a service-account workaround, but a principal in the identity system with bounded scope, bounded lifetime, and an audit footprint. An agent that is authenticated as a blurry extension of a human is not delegated. It is impersonating.

A concrete failure case

Figure 07.2/Scoped agent identityCLICK TO ENLARGE

Imagine a tax-preparation agent operating inside a professional workflow of the kind described by Joel Hron at Thomson Reuters. The system ingests source documents, extracts fields, maps them into a tax engine, checks validation errors, revisits documents to resolve missing information, and assembles a draft return. Nothing in that flow requires the model to be malicious for trouble to begin. It only has to be slightly wrong in a place where authority and evidence are easy to blur.

Suppose one document is ambiguous, a number is mapped to the wrong field, and the validation engine throws an error. The agent now has several possible powers: search for more supporting material, pull additional client records, reinterpret the rule, override a warning, or proceed with a draft that looks internally coherent. In a weak design, those powers can collapse into one another. The same system that is supposed to summarize evidence also has enough access to fetch more data, make a judgment call, and keep moving. From the outside, the workflow still looks smooth. The danger is not theatrical failure. The danger is an disappearing inside a competent-looking trajectory.

In high-stakes work, the risky move is often not one bad answer. It is a system quietly crossing from assistance into authorization.

Least privilege is different for agents than for users

Figure 07.3/Step-up OAuthCLICK TO ENLARGE

Least privilege is the oldest piece of security advice, and on its own it is not enough.

In classic SaaS software, a narrow scope usually constrains a bounded API action. In agentic systems, the same scope can be recombined across many steps. A permission that seems harmless in isolation can become more powerful when paired with retrieval, reasoning, persistence, and retries. So "only give the agent the minimum access it needs" is correct but incomplete. Teams also have to ask: minimum access for which stage of the workflow?

This is where Hanson's argument compounds with Galow's. A standing credential gives the agent maximum scope for the maximum duration. An OAuth-scoped token with cross-app access gives the agent a narrow scope for a narrow duration, mediated by an identity provider that can see the request and revoke it. The first option lets the agent do anything the human can do, on a schedule the agent chooses. The second option lets the agent do this specific thing, for this specific task, with an audit trail that survives the interaction.

The deeper is that identity and authority are joint design objects. Granular authority without a principal to attach it to is unenforceable. A principal without scope is just a louder user. The hard work is binding them: producing an agent-shaped identity with agent-shaped scopes that match the agent-shaped workflows the system actually runs.

Sandboxing is product infrastructure

Figure 07.4/Enterprise MCP has one shapeCLICK TO ENLARGE

The same principle appears in code-executing environments, with the stakes turned up.

Fouad Matin's security guidance for coding agents at OpenAI keeps returning to four concrete controls: sandboxing, network restriction, privilege boundaries, and human review. Treat them as the default-on baseline, not the hardening you add after an incident — a capable system will sometimes be confused, manipulated, or overly eager, and each of the four bounds a different failure. Sandboxing contains a bad command, network restriction stops exfiltration and unexpected callouts, privilege boundaries cap what the agent can reach even when it is wrong, and a human review gate catches the trajectory the other three let through. Real safety has to live in the runtime and permission system, not only in the model.

Harshil Agrawal at Cloudflare makes the same case from the platform side. His talk on sandboxing AI-generated code argues that the sandbox is not an afterthought added once the model misbehaves; it is the substrate the agent runs on. If the agent is going to execute code, that execution needs to be bounded by default — network restricted, filesystem scoped, time-limited, resource-capped. The bounds are the product. Without them, the system is not running code on behalf of a user. It is granting the model an open shell on infrastructure.

Apple Private Cloud Compute, profiled by Jmo at CONFSEC, sits at the high end of this spectrum. The architecture is built around the principle that even Apple cannot see what runs inside a user's session. The cryptographic boundary is structural. That model is overkill for most product surfaces, but it makes the point in the strongest possible form: when the cost of getting trust wrong is unbounded, the boundary has to be designed-in rather than enforced socially. The right amount of sandbox for an enterprise coding agent is less than Apple's. It is also not zero.

Sandbox, least privilege, and auditability belong in the same category as evals, , and durable runtimes: they are product infrastructure, not security overhead. A team that treats them as the security team's problem will discover the boundary by getting it wrong in production. A team that treats them as part of the runtime spec has a chance of getting them right before that.

Standardized protocols expand the attack surface

A common hope around the rise of the is that standardization will reduce the security problem by giving everyone a known interface to attack and defend. The actual pattern in the corpus is closer to the opposite. Easier interoperability means more tools can be exposed more quickly, which makes the questions of scope, mediation, review, and audit more urgent rather than less. The practical test: if adopting raises the number of capabilities your agents can reach faster than your team can answer "who can call this, with what scope, and where is it logged?", standardization has expanded your attack surface, not shrunk it.

Tun Shwe at Lenses puts the bottom line plainly: "Your insecure server won't survive production." The talk is essentially a tour of the ways servers ship with assumptions that survive demo conditions and fail under real load. Authentication treated as a configuration option. Tool descriptions trusted as input. Servers exposed publicly because internal routing was the harder problem. None of these are -specific bugs. They are the predictable consequences of pulling a wire protocol into a category — capability exposure — that has not had the security review the wire has.

Karan Sampath at Anthropic, working on enterprise rollouts, names the structural answer from the governance side. "The really important thing for security teams ... is they need to establish a root of trust," he says. The enterprise version of this problem is not just performance, and it is not even just authentication. It is whether the security team can inspect the capability surface, reason about its risks, and produce a defensible record of what was allowed and why. A protocol that makes capability exposure faster does not lower that bar; it raises it.

David Mytton at Arcjet builds the same argument from outside-the-perimeter. His talk on defending sites from AI bots reads as a parallel chapter to the discussion: as the ability to script automated interaction expands, the cost of identifying and bounding that interaction expands with it. The defender's surface is no smaller. It is larger, and the attacker now has standardized tools. It is tempting to assume the bot-defense problem and the -exposure problem are separate teams' work; both are the same question — distinguishing authorized automated callers from unauthorized ones — and standardization arms the attacker on both sides of the perimeter at once.

Protocol standardization expands the attack surface if governance lags. It is not an argument against standardization. It is an argument that standardization is a forcing function for governance work that has to happen in parallel, not afterwards.

Enterprise MCP rolls up to gateways and a root of trust

Once enterprise teams take the security problem seriously, the architecture they reach for converges on a recognizable shape — and the convergence is specific enough to use as a checklist. There is a gateway. There is a policy plane. There is a registry of blessed servers. There is a permissions model that knows about identities, tools, and approval requirements. There is an audit log that records what each agent invocation actually did. If you are standing up enterprise and any one of those five is missing, that gap is the likeliest place for the boundary to fail first. The shape looks a lot like the API-gateway and IAM patterns that mature enterprise SaaS settled into a decade earlier — applied to a new category of capability, which means the teams that already own those patterns are the ones to build the agent version, not a greenfield security project.

Sampath's enterprise work names the architectural conclusion directly. The root of trust is established at the platform, not at the individual tool. Servers are reviewed before they are allowed in the corporate environment. Tools are scoped to the principals that should be able to call them. Audit is built in at the gateway layer rather than tacked onto each integration. This is the enterprise equivalent of moving from "anyone can install any app" to "we have an internal app store with security review." Unglamorous, and exactly what lets the technology be used inside a regulated enterprise at all.

Sam Morrow's GitHub work shows the same shape at production scale. As part of scaling for GitHub's customers, the team filtered tools by PAT (personal access token) scopes — so an agent invoking a GitHub server only sees the tools its token actually authorizes — and used step-up OAuth to request additional privileges only when needed, rather than pre-granting them. The effect is that the agent's effective capability surface is dynamically bounded by the identity it is acting under and the task it is currently performing. Capability follows authority, not the other way around.

These two cases together make the architectural argument concrete. Enterprise adoption pushes toward gateways, blessed platforms, and a root of trust. Naming the pattern matters because it changes what teams build first. Instead of "ship one server and add another, and another," the platform shape becomes the first deliverable. Individual servers are then deployed against a structure that knows how to govern them.

Per-tool OAuth flows are a governance problem

A related issue surfaces in everyday agent operation, and it is one of the cases where what looks like a UX annoyance is actually a governance failure.

Most current and agent integrations require the user to authorize each tool separately. The agent wants to talk to email. Consent flow. The agent wants to talk to calendar. Consent flow. The agent wants to talk to the document store. Consent flow. To the operator, this is a paper-cut sequence of OAuth dialogs. To the security team, it is far worse: it is invisible. Each consent grant happens between the user and the third-party tool, mediated by an identity layer the enterprise may or may not control. The IT organization has no central record of what authority the user just delegated, to which tools, under what scopes, with what expiry.

Galow's cross-app access work at WorkOS is one of the cleanest framings of the structural answer. The identity provider becomes the bridge between clients and servers, so that credentials can be obtained without repeated manual consent flows and so that the enterprise has a single place to see — and revoke — what was delegated. That second part is the load-bearing one. A faster consent flow with no visibility is not progress. A consent flow that produces a visible, revocable audit trail at the identity layer is.

Hanson's OAuth argument fits into the same shape. The right pattern for agent credentials is short-lived, scoped tokens issued through a flow the identity provider can audit — not standing keys, not impersonation, not human credentials borrowed for agent work. Morrow's step-up OAuth on GitHub is the same pattern at the tool layer: the agent never holds more authority than it currently needs, and any escalation goes through a flow that produces a record.

Repeated per-tool OAuth flows are not just annoying; they are a governance and IT visibility problem. The fix is structural — push the trust bridge into the identity provider — and the architectural payoff is the same as in the gateway argument: the enterprise can answer the basic auditing questions ("who delegated what, to whom, when, and for how long?") because the system was built to capture them, not because someone reconstructed them after the fact.

Audit as part of the trust model

The chapter has so far drawn most of its evidence from enterprise and developer-tools contexts. The corpus contains one public-sector talk that is worth surfacing because it raises the constraints to a level that makes them clearer than the enterprise versions.

Mark Myshatyn at Los Alamos National Lab presented on government agents at one of the AI Engineer events. The talk reads, in places, like the rest of this chapter taken to a higher cost function. When the user is a federal employee, the data is classified, and the action might be regulated under specific statutes, the identity, authority, and audit questions stop being best practices and become legal requirements. The agent has to act under a real principal. The scope has to be documented. The audit trail has to survive review. The boundaries have to be enforceable, not aspirational. The talk is useful in not because every enterprise looks like a national lab, but because the national lab makes the constraints visible — and most of them turn out to be the same constraints other regulated industries are starting to discover for themselves.

The observability argument from the previous chapter returns here in a stricter form. Audit trails, trajectory views, approval logs, and replayable histories are not just debugging conveniences. They are part of the trust model. If a system drafts a return, assembles a legal research memo, or performs a sensitive code change, an institution needs to be able to answer basic questions afterward: who authorized this path, what evidence was consulted, what tools were used, what warnings appeared, and where a human judgment entered or failed to enter. Without those answers the institution cannot certify the work. Without certification, the work cannot ship.

This is also where has to stay honest. Conference talks are especially good at surfacing strong patterns, field reports, and architecture instincts. They are weaker as proof of universal effectiveness. So the responsible here is not that the industry has solved trustworthy delegation. It has not. The more grounded is that serious teams keep discovering the same constraints: narrow scopes, explicit identities, mediated tools, sandboxes, review points, and inspectable histories.

A machine colleague is not trustworthy because it sounds careful. It is trustworthy, if at all, because its power has shape.

The next chapter takes the chapter-7 frame and stress-tests it under realtime conditions, where the cost of getting bounded authority wrong becomes audible to the user inside a single conversation. But the move there only works because the chapter behind it has named the shape of power, the substrate of identity, and the architecture of audit. Those are the pieces. The next chapter is what happens when they have to ship under a 200-millisecond clock.

What to do with this

  • Audit your agents for standing credentials. Any long-lived API key, service-account token, or "borrowed" personal access token is the agent inheriting a human's full authority with no scope to revoke and no expiry that maps to the task. Replace them with short-lived, scoped OAuth tokens bound to a session, a user, and a set of scopes — the pattern Hanson at Keycard argues is the right shape for delegation, not just a protocol choice.
  • Make the agent a real principal in your identity provider, not a passenger on a human's credential. Auth0's Riley and Galan frame this as "we authorize agents, servers" — once the agent has its own scopes, lifetime, audit footprint, and revocation path, an agent authenticated as a blurry extension of a human stops being a delegation and starts being impersonation.
  • When scoping agent permissions, ask "minimum access for which stage of the workflow?" — not just "minimum access." A scope that is harmless in isolation can compound across retrieval, reasoning, persistence, and retries, so least privilege for agents has to be bound to the workflow stage, not granted once for the whole run.
  • Make the sandbox the default substrate for any code-executing agent: network restricted, filesystem scoped, time-limited, resource-capped (Agrawal/Cloudflare), plus Matin's four controls — sandboxing, network restriction, privilege boundaries, and human review. Without those bounds you are not running code on behalf of a user; you are granting the model an open shell on your infrastructure.
  • Before standing up enterprise , build the platform shape first, not one server at a time: a gateway, a policy plane, a registry of blessed servers reviewed before they enter the environment, a permissions model over identities and tools, and an audit log at the gateway layer. Sampath's rule is to establish the root of trust at the platform, not the individual tool — treat it like an internal app store with security review.
  • Filter tools by the caller's token scopes and use step-up OAuth to request extra privilege only when needed, the way Morrow's team did for GitHub's — so the agent only ever sees the tools its token authorizes and any escalation produces a record. And push per-tool consent through the identity provider as a trust bridge (Galow/WorkOS) so IT keeps one revocable, auditable record of who delegated what, to whom, when, and for how long.
EVIDENCE OF SOURCE · CHAPTER 07 · VIDEOS

13 claims · 50 source anchors

Evidence — Source Anchors

Chat is an insufficient control surface for long-running or high-stakes work

  • Chat is one-dimensional. It's a very low bandwidth interface,
    #3 — Jacob Lauritzen, Legoraconfidence: high
  • we're asking AI systems to now produce output and produce judgments and decisions
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • handle state potentially over long periods of time. There needs to be human interaction for approvals
    #167 — Preeti Somal, Temporalconfidence: high

Evals are a control system, not just a test suite

  • improvement without measurement is limited and imprecise.
    #125 — Ido Pesok, Vercel v0confidence: high
  • We still want to build reliable scalable applications and that is still hard
    #184 — Samuel Colvin, Pydanticconfidence: high
  • eval to us it's actually the same problem from a from a systems perspective.
    #628 — Phil Hetzel, Braintrustconfidence: high
  • small CLI tool that we call eval tool
    #689 — Lawrence Jones, incident.ioconfidence: high
  • designed to allow agents to leverage our eval suite files.
    #689 — Lawrence Jones, incident.ioconfidence: high
  • classic benchmark maxing.
    #746 — Ara Khan, Clineconfidence: high
  • There are right ways to use them. There are wrong ways to use them.
    #746 — Ara Khan, Clineconfidence: high

Durable state and workflow semantics are trust features, not backend details

  • once we get into longer running workflows, that's where it really becomes a problem.
    #99 — Samuel Colvin, Pydanticconfidence: high
  • no one's going to trust your agent.
    #167 — Preeti Somal, Temporalconfidence: high
  • the workflow orchestration layer needs to be deterministic. So it can be rerun um in a in a uh deterministic fashion
    #44 — Peter Wielander, Vercelconfidence: high
  • where I've got some big production CI stack to go and run and deployment takes hours, being able to go and change variables in production or in staging very quickly
    #657 — Samuel Colvin, Pydanticconfidence: high
  • you'll be able to assemble agent teams that can complete tasks orders of magnitude harder than what you can complete with a single agent today.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • minding the gap around observability.
    #680 — Amy Boyd & Nitya Narasimhan, Microsoftconfidence: high

Human oversight works best as an architectural layer, not an afterthought

  • There needs to be human interaction for approvals or other reasons and of course they need to be able to be uh able to run in parallel for efficiency
    #167 — Preeti Somal, Temporalconfidence: high
  • dial these agency dials far up.
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • maintaining a factory would require you to have an overview of the processes you want your coding agents to go through.
    #629 — Eric Zakariasson, Cursorconfidence: high

High-stakes systems tune agency instead of maximizing it

  • a binary thing but as a lever that you can dial
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • agentic workflows we can plan and execute
    #201 — Yogendra Miraje, Factsetconfidence: high
  • send it to me for approval.
    #202 — Rita Kozlov, Cloudflareconfidence: high
  • credentials, payments, and checkout require determinism.
    #745 — Steve Kaliski, Stripeconfidence: high

The next failure frontier is context misassembly, not just hallucination

  • there's this third thing, which I think is like really new and no one is doing it yet, which is training things into weights.
    #48 — Jack Morrisconfidence: high
  • this is really useful if you're building anything related to some sort of internal deep research sort of API
    #47 — Ivan Leo, Manus AI / Meta Superintelligenceconfidence: high
  • you combine it with all your other signals. So now if you look at your ranking function
    #156 — David Karam, Pi Labsconfidence: high
  • it's hybrid search because you have multiple approaches, and then you can either boost them together. You could do reranking, which is becoming more and more popular.
    #172 — Philipp Krenn, Elasticconfidence: high

Identity is a first-class engineering object for agentic systems

  • we actually persist scopes we manage lifetimes of tokens um we do a lot of handling there
    #37 — Patrick Riley & Carlos Galan, Auth0confidence: high
  • we go get API keys that are typically longived and broadly scoped. We paste them into some configuration files and environment variables
    #150 — Jared Hanson, Keycard / Passport.jsconfidence: high
  • if you've used MCP at all extensively, you know that it means consent screens on top of consent screens on top of consent screens.
    #627 — Garrett Galow, WorkOSconfidence: high

Sandbox, least privilege, and auditability are product infrastructure, not security overhead

  • making sure that you're actually providing the correct level of sandboxing, whether it's uh containerization or it's using app level sandboxing,
    #152 — Fouad Matin, OpenAI (Codex, Agent Robustness)confidence: high
  • We have been sandboxing untrusted code for decades. Your browser does it right now. Every tab run in its own sandbox.
    #31 — Harshil Agrawal, Cloudflareconfidence: high
  • these are what they call enforceable guarantees, not just policies.
    #149 — Jmo, CONFSEC, on Apple Private Cloud Computeconfidence: high
  • we see it as the greatest opportunity and the greatest threat to national security,
    #86 — Mark Myshatyn, Los Alamos National Labconfidence: high
  • never compromise trust for convenience.
    #744 — Michael Hablich, Google (Chrome DevTools)confidence: high

Protocol standardization expands the attack surface if governance lags

  • there's no halfway house because you can't do a little bit of production. You're either behind the wall or you're standing out in the open.
    #32 — Tun Shwe, Lensesconfidence: high
  • The really important thing for security teams and enterprises that want to allow this to be decentralized is they need to establish a root of trust.
    #624 — Karan Sampath, Anthropicconfidence: high
  • something like operator just shows up as a Chrome browser and it's much more challenging to understand and detect
    #148 — David Mytton, Arcjetconfidence: high

Enterprise MCP adoption converges on gateways, blessed platforms, and a root of trust

  • we think that the goal for a secure this for any security team is to is to bless one platform.
    #624 — Karan Sampath, Anthropicconfidence: high
  • challenges we've faced building and scaling our remote server, how we've overcome them,
    #625 — Sam Morrow, GitHubconfidence: high
  • if we continue this pattern for hundreds or thousands of agents, we've got a pretty big security problem on our hand.
    #150 — Jared Hanson, Keycardconfidence: high

Per-tool OAuth flows are a governance and IT visibility problem, not just a UX annoyance

  • you have this like lasting access problem that it doesn't have any visibility over
    #627 — Garrett Galow, WorkOSconfidence: high
  • We know how to transition away from static secrets uh, to dynamic access using OOTH.
    #150 — Jared Hanson, Keycardconfidence: high
  • if you log into GitHub MCP with a PAT token that we just immediately filter the tools down by the scopes that the token has.
    #625 — Sam Morrow, GitHubconfidence: high

Once an AI system can act autonomously, bounding its authority becomes the price of deployment

  • we're asking AI systems to now produce output and produce judgments and decisions
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • most primitives the magic happens when you combine these things together
    #138 — Sam Bhagwat, Mastra.aiconfidence: high

Agent commerce is a new infrastructure layer: agents transact on a human's behalf, shifting the stack from payment rails to delegated intent and verifiable authority

  • AI digitizes the participants and their interactions.
    #200 — Adam Behrens, New Generationconfidence: high
  • we go from low-level payment infrastructure to higher level intent infrastructure.
    #200 — Adam Behrens, New Generationconfidence: high
  • help agents interact with the economy starting with e-commerce
    #503 — Justin, Ionicconfidence: high
  • adapt to that new kind of buyer.
    #745 — Steve Kaliski, Stripeconfidence: high
AI QUALITY · CHAPTER 07 · MASH JUDGES
1 unsupported claim — ship-blocker

scored on version git:2f2668c

EVIDENCE OF SOURCE · CHAPTER 08
FIG. 08 · BEFORE · AFTERCLICK · SCROLL · ZOOM

AI Engineer Book · Ch 08

Realtime, Voice, and the Cost of Being Interruptible

1/6

FIG. 08.0 · OPENER

Turn-based chat vs realtime pipeline

Click to enlarge

CH08

CH. 08 // Drafting
3,482 words15 min read
CHAPTER 08/3,482 words/Drafting

Chapter 8 — Realtime, Voice, and the Cost of Being Interruptible

Text chat flatters AI systems. Voice does not.

In text, users tolerate delay, awkwardness, and weak handoffs because the interface gives the system time to recover. In spoken interaction, every defect becomes audible. If the model pauses too long, it sounds confused. If a tool call stalls, it sounds incompetent. If the system interrupts badly or loses the thread after a clarification, it stops feeling like an unfinished prototype and starts feeling like a bad conversational partner.

That asymmetry is why voice deserves a chapter, not a section. Realtime interaction pressure-tests almost every earlier in the book. Weak context becomes an audible loss of thread. Brittle runtime design becomes broken interruption handling. Sloppy authority boundaries become dangerous casual approvals. Slow tools become conversational incompetence. Voice does not introduce a new thesis; it makes the existing one harder to hide from.

A stress test the model alone cannot pass

Figure 08.1/Latency budgetCLICK TO ENLARGE

The natural assumption among engineers stepping into voice for the first time is that the model is the bottleneck. If the speech-to-speech system feels stiff, the obvious culprit is the LLM. Make it smarter, make it faster, and the experience converges to a conversation.

The practitioners building production voice systems do not see it that way anymore. Sean DuBois at OpenAI and Kwindla Hultman Kramer at Daily titled their joint talk "Your realtime AI is ngmi" for a reason: the gap between current and natural conversation, in their experience, is dominated by problems no model improvement will close. Tool-call variance. Overlap handling. Interruption recovery. Turn budgets. None of those are model-quality problems. They are systems problems with an architectural shape.

Neil Zeghidour at Gradium AI makes the strongest version of this argument. "The main bottleneck is becoming the tool call," he says, after walking through what each layer of a now costs. Speech recognition is fast. The model itself is fast. Speech synthesis is fast. What is not fast — and worse, what is not predictable — is the call out to a function or an external service that the agent has to make in the middle of a sentence. "You have a tool call or open router that is going to have a latency between 500 milliseconds and 4 seconds," he notes. The variance is more painful than the average. A 4-second pause in conversation is not a slow response. It is a dead line.

Dippu Singh frames the same problem from inside the contact-center stack. "Processing real-time voice data is an engineering minefield of latency, accents, and interruptions," he writes, and minefield is the precise word — the failures do not announce themselves on the way in. They cascade. A bad turn cue causes the next turn to misroute, which causes the retrieval to fetch the wrong record, which causes the response to be confidently wrong. By the time anyone notices, the call is over.

The chapter that follows treats voice the way the rest of the book has treated coding, evaluation, context, and runtime: as a problem of the loop around the model, not of the model itself.

Latency is a budget across the whole stack

Figure 08.2/Speech layer wrapperCLICK TO ENLARGE

Zeghidour's working number for natural conversation is precise. "The entire stack of understanding, producing an answer, and pronouncing it" needs to land "around 200 milliseconds." That is the budget. Not the recognition latency. Not the inference latency. The end-to-end turn budget — from the user finishing their sentence to the user hearing the system's reply.

Two hundred milliseconds is approximately the length of a comfortable conversational gap in English. Use it as a pass/fail line, not an aspiration: measure end-to-end turn latency from end-of-utterance to first audio packet, and treat anything past ~200ms as a hesitation the user will register consciously or not. Two beats and the system feels slow; three and it feels confused. The trap is optimizing the wrong number — shaving inference milliseconds while the turn budget blows out on a tool call.

Inside that budget, the system has to do at least: detect end-of-utterance, transcribe the audio, build the context, run inference, possibly call one or more tools, generate the response text, synthesize it as audio, and start streaming it. The components that have been heavily optimized — speech recognition, model inference, text-to-speech — are no longer the dominant cost. Mark Backman's Daily workshop on realtime voice walks through this in production-grade detail: the engineering wins of the last two years have come almost entirely from squeezing variance out of orchestration, not from making any single component faster on the median.

Kwindla Hultman Kramer's Pipecat Cloud work makes the same point as infrastructure rather than commentary. The whole architecture is organized around the budget. Audio frames arrive on a deterministic clock. Each stage of the pipeline has a deadline. The system is designed assuming that any individual call might be late, and the orchestration handles the lateness — not the component that was late.

This is the inversion the chapter rests on. In a text agent, latency is a property you optimize. In a , latency is the substrate the system has to be built on. Once you accept that, every other architectural choice — wrap-don't-rebuild, full-duplex vs half-duplex, fillers, parallel tool calls — falls out as a consequence.

Wrap, don't rebuild

Figure 08.3/Interruption runtimeCLICK TO ENLARGE

A common mistake in voice projects is to assume that the existing chat agent has to be redesigned for voice. The instinct is reasonable. Voice feels different enough from text that surely the architecture has to change.

Luke Harries at ElevenLabs makes the case for the opposite move with one of the more honest sentences in the corpus. "I've already got my agent. I spent loads of time doing the evals," he says, framing the eval infrastructure not as something the voice migration can throw away but as the asset that justifies keeping the existing architecture. The right move, in his framing, is to "wrap it up into its own first-class primitive" — to add a voice engine to the agent rather than rebuild as a .

The reasoning is sharper than it first appears. The hard parts of a production agent — the tools, the evals, the durable state, the permission model, the retry logic, the observability — are not voice-specific. They are agent-specific. Throwing them away to start from a speech-native stack means throwing away the work that made the system trustworthy in the first place. As Harries notes, "your chat agent actually normally does the majority of tool calling." The voice layer's job is to make that agent audible. The agent's job remains the same.

Thor Schaeff's ElevenLabs workshop on building conversational AI agents is essentially an operational version of the same argument. The starting point is the agent you have. The voice layer adds turn detection, barge-in handling, partial audio streaming, and a speech model — but it does not replace the orchestration that already worked.

This pattern shows up across the cluster. Brooke Hopkins at Coval frames her voice work as "from self-driving to autonomous ," and the analogy is not casual: she treats the voice runtime the way an autonomy stack treats the perception and planning layers — as composable subsystems with their own contracts, not as a monolithic rewrite of the underlying control system. The operational test: if adding voice forces you to touch the agent's tools, evals, or state model, you are rebuilding when you should be wrapping. Draw the seam at the interface, give the voice layer (turn detection, barge-in, audio streaming, speech model) its own contract, and leave the orchestration untouched. The agent persists across the migration. The interface does not.

The deeper is that the architectural move the book has been advocating since Chapter 3 — keep the orchestration durable, change the surfaces around it — is exactly the move that pays off when voice arrives. A team that built a strong text agent has already done most of the voice work. A team that built a thin chat wrapper has to start over.

Half-duplex is the silent architectural ceiling

Figure 08.4/Half-duplex is the ceilingCLICK TO ENLARGE

Once the engineering of latency and orchestration is in place, the next ceiling is conversational, not technical.

Most production speech-to-speech systems are half-duplex. They can listen, or they can speak, but not both at the same time. As Zeghidour puts it directly: "The model is either listening or it's speaking." That single architectural choice has consequences for every interaction that does not happen one turn at a time, which is to say, every interaction that resembles real conversation.

Real conversation has overlap. The listener says mhm while the speaker keeps going. The speaker pauses and the listener fills the pause with a question. Both parties speak briefly at the same time when the listener wants to take the turn. Zeghidour names these patterns explicitly. "Overlap between uh people speaking on one another." "Back-channeling." Both are first-class features of human dialogue, and both are unavailable to a model that can only do one of listening or speaking at any given moment.

Tom Shapland's LiveKit talk "Why ChatGPT Keeps Interrupting You" is essentially a case study in what half-duplex feels like at scale. The system has no way to lean in without taking the turn. It has no way to wait without going silent. Every conversational micro-decision collapses into the same binary: am I listening, or am I speaking?

The instinct of many teams is to patch this in product. Add a wait button. Add a tone-detection heuristic. Add a longer silence threshold. The trap: each patch helps marginally and is paid for in worse responsiveness somewhere else — a longer silence threshold buys cleaner turn-taking at the cost of a slower system. Treat these as symptoms, not fixes. When you find yourself stacking silence-timing tweaks, the right read is that the half-duplex constraint is propagating, and the lever is the architecture, not the timing.

The cleanest long-term answer is full-duplex architecture — systems that can simultaneously listen and speak. Zeghidour's own Moshi work is one of the few production-scale demonstrations. As he is the first to point out, full-duplex models pay for the architectural change at the agent layer: the model can do conversational overlap, but it is materially weaker at the actual reasoning. The trade is real. The half-duplex world is a workplace; the full-duplex world is a research frontier.

What changes in either case is the framing. Turn-taking is not a UX polish that can be papered over with timing tweaks. It is an architectural feature of the system that determines what kinds of conversation are possible at all.

Latency must be masked, not just minimized

Figure 08.5/Mask latency, don't just minimize itCLICK TO ENLARGE

The 200-millisecond budget is a useful target. It is also, for tool-call-heavy agents, often impossible. When the tool latency variance is 500 milliseconds to 4 seconds, no amount of optimization gets the worst case into the conversational window. Optimization has a floor. Conversation does not.

The systems that ship anyway do so by adding a layer the book has not yet named: latency masking.

The basic idea is that the system, while it waits for a slow operation, keeps the conversation alive. Zeghidour describes the pattern as the agent producing fillers and partial answers: "While it waits for getting the result back, it can keep the conversation going." A short acknowledgement here. A clarifying question there. A one moment that buys 1.5 seconds of tool time without leaving the user staring into silence.

This is not animation. It is conversational scaffolding. The fillers are doing structural work — they preserve the user's sense of turn, they signal that the system is alive, and they create slack the orchestration needs to handle variance. The decision rule: any operation whose latency variance can exceed the turn budget needs a masking strategy before it ships — a short acknowledgement or clarifying question emitted the instant the slow call starts, not after it stalls. Done well, a 3-second tool call disappears into the rhythm of the exchange. Done poorly, the same call sounds like a freeze.

Harries makes the same point from the architecture side. The voice layer needs the ability to interleave low-latency speech with the agent's longer-running operations, which is why he treats the voice engine as a first-class primitive with its own state — not as a streaming pipe attached to the agent's output. When the agent is busy, the voice layer is not idle. It is doing the conversational equivalent of nodding.

Suman Debnath's VoiceVision RAG work at AWS surfaces the same pattern at a different layer. Visual document intelligence integrated with voice response has to maintain the conversational thread while a vision model takes its turn. The user does not care that a document was being read in the background. They care that the conversation did not stall.

The reason this matters for the book's broader thesis is that latency masking is one of the cleanest examples of the scaffolding-makes-reliability argument. The model can be excellent. The tools can be excellent. The orchestration can be excellent. If the system has no way to keep the conversation alive while the orchestration runs, the experience falls apart. The masking layer is reliability infrastructure. It needs to be designed in, not bolted on at the end.

Latency masking belongs in the same architectural category as evals, , and durable runtimes. It is the runtime cost of being interruptible.

TTS architecture is converging on LLM architecture

The voice cluster also produces one of the cleanest pieces of evidence the book has for its broader that scaffolding ideas generalize across substrates.

Samuel Humeau at Mistral, whose work on TTS-as-LLM is the chapter's reference for this thread, puts the convergence plainly. "Pretty much everybody is using an auto-regressive decoder backbone," he says, describing the architectural shape the leading text-to-speech systems have settled into. The same tokenize-and-stream-and-generate pattern that drove text agents now drives speech generation. The same first-packet-latency obsession that shaped LLM serving now shapes TTS serving. The substrates are different. The discipline is the same.

If Humeau's read is right, the convergence is not cosmetic. The TTS-as-LLM framing means that the things the book has spent ten chapters arguing about — streaming, conditioning, latency budgets, perceived-latency optimization — are now also the right lens for speech-generation systems. As Humeau notes, "the latency is key here" and "the perceived latency is lower" when the system streams the first audio packet before the full response has been generated. That is exactly the streaming-and-early-emit pattern the chat side has been refining for years.

Humeau also identifies the use case driving the convergence. "The king use case for text-to-speech is its usage within agents," he says — meaning the architectural choices in the TTS layer are increasingly being optimized for embedding in a larger agentic system, not for standalone audio production. The TTS layer is being shaped by the same pressures the agent layer is shaped by.

Humeau is admirably specific about where his own released model departs from the pattern: it uses diffusion or flow-matching for the per-frame stage rather than full autoregression. The convergence is about the backbone, not full-stack. That distinction is worth keeping; it is a reason to treat the convergence as a strong signal rather than a settled fact. But the direction is unambiguous, and on Humeau's account it shows up beyond Mistral: the same architectural family — autoregressive decoder backbones streaming token-by-token — appears across other leading voice work.

The implication for the book's broader frame is that the scaffolding-first thesis is not a text-agent argument. It is an architecture argument that happens to be especially clear at the agent layer. The voice cluster confirms it from a different substrate.

Robotics: the same constraints at higher stakes

A short note on robotics, which the chapter title gestured at.

The robotics talks in the cluster — Annika and Aastha on GR00T N1 humanoid foundation models, Jyh-Jing Hwang on Waymo's EMMA, Quan Vuong and Jost Tobias Springberg on Physical Intelligence's "robotics: why now?" framing — read as with the latency budget tightened and the failure mode physicalized. The bottleneck shifts from tool calls to perception-to-action loops. The half-duplex ceiling shifts from turn-taking to one-action-at-a-time. The masking layer shifts from fillers to motion-plan continuity. The architectural shape is the same.

The book's is not that voice and robotics are the same problem. It is that they are stress-tests of the same underlying frame. When the system has to operate inside a latency budget set by physics rather than by user preference, the scaffolding-first argument is no longer a stylistic choice. It is what determines whether the system can act at all.

That makes voice the right first embodied edge to chapter — the engineering is mature enough to teach. Robotics is the right next one.

What voice confirms about the book's frame

The chapter started with an asymmetry: text hides what voice exposes. Everything since has been a tour of what the text channel was quietly letting the system skip.

Every section above named a problem voice exposes that text mostly hides. Coordination, not model quality. End-to-end latency budgets, not single-component speed. Architectural reuse, not voice-native rewrites. Half-duplex ceilings, not interface choices. Masking as infrastructure, not animation. TTS-as-LLM, not isolated speech modeling. In each case the chapter's is that voice did not introduce a new failure mode — it made an existing one audible.

That is also the chapter's connection back to the rest of the book. Chapter 3 argued that the around the model determines what a coding agent can be trusted to do. Chapter 4 argued that evals are the control system that holds the honest. Chapter 5 argued that context is the substrate the model reasons over. Chapter 6 argued that runtimes are what carry state across the gaps in a single agent run. Each of those sharpens under realtime pressure. Each becomes audible. Voice is the substrate where the book's frame stops being inferable and starts being heard.

The next chapter widens out from this stress test to the organizational question the book has been pointing at since the opening. If reliable AI is a property of the loop around the model — and if the loop now includes context architecture, durable runtimes, and realtime infrastructure — then the team that ships dependable AI looks different from the team that ships a chatbot. Chapter 9 is about what that team looks like.

The last word in the voice cluster belongs to Zeghidour, who has been honest enough throughout his talk to keep saying that there is more to do. "The last mile is going to be the most difficult to solve," he says, near the end. The chapter would not put it differently. The first mile of voice was a model problem. The last mile is an infrastructure problem. The first one is mostly solved. The second one is the work.

What to do with this

  • Measure the end-to-end turn budget, not component speed. Instrument latency from end-of-utterance to first audio packet and treat ~200ms as the pass/fail line; if you're shaving inference milliseconds while a tool call blows the budget, you're optimizing the wrong number.
  • Wrap your existing agent; don't rebuild it as voice-native. Add the voice layer (turn detection, barge-in, partial audio streaming, speech model) as a first-class primitive over the agent you already evaluated — and use this test: if adding voice forces you to touch the agent's tools, evals, or state model, you've crossed from wrapping into rebuilding.
  • Profile your tool-call latency variance, not its average. A call that averages fine but spikes to 4 seconds is the real failure; Zeghidour's "500 milliseconds and 4 seconds" range is what kills conversations, because a 4-second pause is a dead line, not a slow response.
  • Build a latency-masking layer for any operation whose variance can exceed the turn budget. Emit a short acknowledgement or clarifying question the instant the slow call starts — not after it stalls — so a 3-second tool call disappears into the rhythm instead of sounding like a freeze. Design it in as reliability infrastructure, not as a bolted-on afterthought.
  • Recognize half-duplex as an architectural ceiling, not a UX bug. When you find yourself stacking silence-timing tweaks and wait buttons, stop — each patch is paid for in worse responsiveness elsewhere; the lever is the architecture, and full-duplex buys conversational overlap at the cost of weaker reasoning, so make the trade deliberately.
  • Treat TTS as an LLM-shaped system. Reuse the streaming, first-packet-latency, and early-emit discipline from the chat side — stream the first audio packet before the full response is generated so perceived latency drops — because leading TTS work has converged on autoregressive decoder backbones optimized for embedding in agents.
EVIDENCE OF SOURCE · CHAPTER 08 · VIDEOS

7 claims · 28 source anchors

Evidence — Source Anchors

Durable state and workflow semantics are trust features, not backend details

  • once we get into longer running workflows, that's where it really becomes a problem.
    #99 — Samuel Colvin, Pydanticconfidence: high
  • no one's going to trust your agent.
    #167 — Preeti Somal, Temporalconfidence: high
  • the workflow orchestration layer needs to be deterministic. So it can be rerun um in a in a uh deterministic fashion
    #44 — Peter Wielander, Vercelconfidence: high
  • where I've got some big production CI stack to go and run and deployment takes hours, being able to go and change variables in production or in staging very quickly
    #657 — Samuel Colvin, Pydanticconfidence: high
  • you'll be able to assemble agent teams that can complete tasks orders of magnitude harder than what you can complete with a single agent today.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • minding the gap around observability.
    #680 — Amy Boyd & Nitya Narasimhan, Microsoftconfidence: high

High-stakes systems tune agency instead of maximizing it

  • a binary thing but as a lever that you can dial
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • agentic workflows we can plan and execute
    #201 — Yogendra Miraje, Factsetconfidence: high
  • send it to me for approval.
    #202 — Rita Kozlov, Cloudflareconfidence: high
  • credentials, payments, and checkout require determinism.
    #745 — Steve Kaliski, Stripeconfidence: high

Realtime AI quality is primarily a coordination and latency-engineering problem, not a model-quality problem

  • the main bottleneck is becoming the tool call,
    #662 — Neil Zeghidour, Gradium AIconfidence: high
  • the entire stack of understanding, producing an answer, and pronouncing it to be around 200 milliseconds.
    #662 — Neil Zeghidour, Gradium AIconfidence: high
  • you have a tool call or open router that is going to have a latency between 500 milliseconds and 4 seconds.
    #662 — Neil Zeghidour, Gradium AIconfidence: high
  • wrapped it up into its own first class primitive,
    #661 — Luke Harries, ElevenLabsconfidence: high
  • the latency is key here
    #663 — Samuel Humeau, Mistralconfidence: high
  • knowing who said what is as important as what was said
    #742 — Hervé Bredin, pyannoteconfidence: high

Voice is best added as a realtime wrapper around a chat agent, not as a rebuild

  • I've already got my agent. I spent loads of time doing the evals,
    #661 — Luke Harries, ElevenLabsconfidence: high
  • your chat agent actually normally does the majority of tool calling.
    #661 — Luke Harries, ElevenLabsconfidence: high
  • we can go very very far by just using speech as an interface.
    #663 — Samuel Humeau, Mistralconfidence: high

Half-duplex is the silent architectural ceiling on natural voice conversation

  • the model is either listening or it's speaking.
    #662 — Neil Zeghidour, Gradium AIconfidence: high
  • overlap between uh people speaking on one another.
    #662 — Neil Zeghidour, Gradium AIconfidence: high

TTS architecture is converging on LLM architecture

  • pretty much uh everybody is using an auto reggressive decoder backbone
    #663 — Samuel Humeau, Mistralconfidence: high
  • the king use case for text to speech is uh its usage within agents
    #663 — Samuel Humeau, Mistralconfidence: high
  • the intelligence is baked directly into the model.
    #755 — Thor Schaeff, Google DeepMindconfidence: high

Latency masking belongs in the same architectural category as evals, harnesses, and durable runtimes

  • while it waits for getting the result back, it can keep the conversation going in a natural way,
    #662 — Neil Zeghidour, Gradium AIconfidence: high
  • wrapped it up into its own first class primitive,
    #661 — Luke Harries, ElevenLabsconfidence: high
  • I'm going to share one of the latest research paper around retrieval which is a uh vision based retrieval and also uh I just thought to wrap this around with an agent.
    #85 — Suman Debnath, AWSconfidence: high
  • real time is different from non-real time. And by non-real time, I mean everything that's not conversational latency of a few hundred milliseconds or less.
    #142 — Kwindla Hultman Kramer, Dailyconfidence: high
AI QUALITY · CHAPTER 08 · MASH JUDGES
1 unsupported claim — ship-blocker

scored on version git:2f2668c

EVIDENCE OF SOURCE · CHAPTER 09
FIG. 09 · BEFORE · AFTERCLICK · SCROLL · ZOOM

AI Engineer Book · Ch 09

The AI-Native Organization

1/5

FIG. 09.0 · OPENER

Seat licenses vs operating-model redesign

Click to enlarge

CH09

CH. 09 // Drafting
5,048 words22 min read
CHAPTER 09/5,048 words/Drafting

Chapter 9 — The AI-Native Organization

Most AI adoption stories are too small to be the real story. A few engineers get faster. Support summarizes tickets more quickly. A product manager drafts specs in half the time. Those are real gains, and they are also, every one of them, individual gains. Add them up and you have a company that uses AI. You do not yet have an .

The difference shows up on a Monday morning. Picture a company that has already moved past casual adoption. Over the weekend, agents opened pull requests. Product generated three new onboarding flows. Support drafted fixes for a backlog of tickets. Internal automations touched the billing system, the CRM, and the docs. Everyone — and everything — was productive. And the organization walks in on Monday to discover that its problem is no longer a shortage of output. It is a shortage of coherence.

That is the shift this chapter is about. AI does not simply make an organization faster. It moves where the scarce thing lives. When execution gets cheap, the bottleneck stops being production and becomes judgment, prioritization, review, and the design of throughput itself. The companies that win the next decade will not be the ones that bought the most seats. They will be the ones that redesigned the operating model so that cheap generation turns into trusted work instead of expensive noise.

AI-native is an operating model, not a purchase

Figure 09.1/Where scarcity movesCLICK TO ENLARGE

The cleanest way to see the gap between "uses AI" and "AI-native" is to look at what happens at the threshold of full adoption. Dan Shipper, building the AI-native company Every, puts the discontinuity in numbers: "There is a 10x difference between an organization where 90% of engineers use AI versus one where 100% do." The is a deliberate provocation, not a measured figure — the steepness is the point, and the more sober large-scale evidence comes a few paragraphs on. The last ten percent is not a rounding error. It is the difference between AI as a personal productivity tool and AI as a medium the whole organization works in.

The reason is that partial adoption keeps the old workflows intact. If nine in ten engineers use agents but the tenth does not, every process still has to accommodate the human-only path — the review that assumes a person wrote the code, the handoff that assumes a person is on the other end, the planning that assumes work moves at human speed. The workflow cannot be rebuilt around delegation until delegation is universal, which makes the holdout the constraint, not a rounding error: as long as one path assumes human-speed work, you cannot rebuild the surrounding process around delegation, so the gain from the other ninety percent is capped at faster people rather than a faster company. The lever, then, is not buying the ninety-first seat but closing the last human-only path. At ninety percent, you have faster people inside an unchanged company. At a hundred, the company itself can change shape.

That reframing — from procurement to operating-model redesign — is what separates the field reports that sound transformational from the ones that sound like tool reviews. Barr Yaron's 2025 AI Engineering Report, surveying the state of practice across the industry, reads less like a list of which tools won and more like a map of which organizational habits are forming. The teams in the "from hype to habit" cohort — the ones building an AI-first company while still shipping a roadmap — describe the same arc: the early win is individual speed, and the durable win only arrives when the surrounding work is rebuilt to assume that speed.

This is why "we rolled out AI" and "we became AI-native" are different sentences with different price tags. The first is a budget line. The second is a redesign of how work is created, reviewed, and trusted — and it is the redesign, not the rollout, that produces the order-of-magnitude difference Shipper is pointing at.

Cheaper execution moves scarcity up the stack

Figure 09.2/Alignment debtCLICK TO ENLARGE

The book has argued since its opening chapters that when code gets cheap, judgment gets expensive. Chapter 1 made it a about the shift from suggestion to delegation; Chapter 2 made it a about taste. At the organizational scale, it becomes a about where the bottleneck physically sits.

When a single engineer can direct several agents at once, the constraint stops being how fast the team can produce and starts being how fast the team can decide what is worth producing and confirm that what got produced is correct. Justin Reock, working on engineering leadership at DX, frames the leadership job in exactly these terms: the manager's role shifts from allocating production capacity — which is suddenly abundant — to allocating judgment and attention, which stay scarce. The practical test for a manager is to ask which scarce resource each ritual rations. A standup that reports how much got produced is rationing the abundant thing; a standup that surfaces which decisions are unmade and which output is waiting on review is rationing the scarce one. The org chart was built to ration the wrong resource — and the wrong choice is to keep optimizing throughput when the queue that is actually backing up is judgment.

Here has to stay honest, because this is precisely the place where AI hype outruns the evidence. The large-scale studies are more sober than the demos. Yegor Denisov-Blanch's Stanford research, drawn from data on a hundred thousand–plus developers, finds that AI's productivity effect is real but uneven — strong in some task types and codebases, near zero or negative in others, and reliably overstated when teams measure activity instead of outcomes. Some AI-generated work creates rework that quietly eats the gain. The responsible reading is not "AI makes everyone faster." It is "AI changes the distribution of where time goes, and a naive measure will misread motion as progress."

That nuance matters for org design because the wrong metric, applied to cheap execution, actively destroys value. Nick Arcolano's analysis at Jellyfish, built on a dataset of some twenty million pull requests, shows the failure mode at scale: output volume rises, the activity dashboards light up green, and the actual constraint — whether the organization can review, integrate, and trust all that output — goes unmeasured until it breaks. The failure to name is the green dashboard: a count of PRs opened, commits, or lines generated is an activity metric, and in a world where artifacts are cheap, an activity metric measures motion, not progress. The fix is to instrument the outcome instead — rework rate, the share of generated work that ships unreverted, time spent in the review queue — so the metric tracks the resource that actually moved. The scarce resource moved; the measurement did not.

The five levels of AI-native maturity

Figure 09.3/Broaden who createsCLICK TO ENLARGE

The chapter so far has been arguing that AI-native is a destination, not a toggle. But organizations ask a more practical question: where are we now, and what does progress actually look like? The field has accumulated enough field reports, engineering team case studies, and post-mortems from large-scale enterprise deployments to answer that question more precisely than "early versus late."

The pattern that emerges across the corpus is a five-level progression defined not by tooling choices but by what the organization has internalized. Each level has a characteristic bottleneck — the constraint that has to break before the next level becomes accessible.

L0: Prompt-era. AI is a personal tool used by some engineers for some tasks. There are no shared policies, no conventions, no measurement. The bottleneck is access and awareness: not everyone has the tools, and no one has thought about what to do with them collectively.

L1: Assisted. Most engineers have access and use AI routinely for code generation, summarization, and local productivity tasks. But the work product — the PR, the spec, the ticket — enters the same review process it always did. AI makes people faster inside an unchanged workflow. The bottleneck is the detection gap: the organization cannot tell which outputs were AI-assisted, which means it cannot calibrate review accordingly. Activity metrics rise; outcome metrics are unmeasured.

L2: Augmented. The organization has started to treat AI as a participant in the workflow rather than a personal accelerant. There are shared conventions, early evals, or some agreement about what AI-assisted work looks like before it enters review. The integration is fragile and uneven — some teams have it, others do not — but the bottleneck has shifted from access to consistency. The question changes from "do we use AI?" to "do we use it the same way?"

L3: AI-native. Delegation is universal; the last human-only path has been converted. The workflow has been redesigned around the assumption that execution is cheap: review is built as a system, not a heroic act; alignment happens before the agent fan-out; creation is open to non-engineers through hardened, tested paths. The bottleneck is no longer the workflow — it is the measurement. The organization knows what it is producing but still struggles to know whether it is producing the right things at the right quality.

L4: AI-first. Institutional judgment has been externalized — packaged into specs, evals, review gates, and governance policies that are available to agents as well as humans. The operating model itself is versioned and improvable. The organization treats its own the way a software team treats a codebase: something to be maintained, extended, and refactored rather than accumulated and forgotten. This level is rare enough that most field reports describe the path toward it rather than operations at it.

The levels are useful not as a ranking but as a diagnostic. An organization at L2 that believes it is at L3 will keep trying to solve a consistency problem with throughput solutions — more models, more seats, faster generation — and get worse, not better. The bottleneck at each level requires a different kind of work to break: policy and access at L0, measurement and calibration at L1, standardization at L2, redesign at L3, externalization at L4. Misreading your level is how you spend a year on the wrong problem.

What AI-native looks like at different scales

Figure 09.4/Review is the new bottleneckCLICK TO ENLARGE

One of the persistent confusions in the field is that AI-native looks the same regardless of the size of the organization. It does not. A fifteen-person team building an AI product and a thousand-person enterprise deploying AI internally are both aiming for L3 — but the blockers, the failure modes, and the order of operations are almost entirely different.

For small teams — the seed-stage company or the five-to-fifteen-person product group — the biggest risk is not ungoverned sprawl. It is the false plateau. At this scale, individuals are aligned by proximity: there is no because everyone sits in the same room or the same Slack channel and shares context without needing a formal system to carry it. The trap is that this ambient alignment disappears exactly when the team needs to hire, hand off, or scale. The teams that become AI-native at seed scale are the ones that write things down early — specs, decision logs, conventions — not because anyone would otherwise get lost today, but because they are building the that the next ten hires will work inside. The bottleneck is not today's efficiency but tomorrow's onboarding.

For scale-stage companies — the fifty-to-two-hundred-person organization that has been through its first growth cycle — the dominant failure mode is silo proliferation. Every team builds its own AI tooling; every tool generates its own context; no context is shared. Organizations at this stage often discover a dozen different internal AI setups, each locally optimized and globally incompatible. The Bloomberg engineering organization, describing its AI deployment across a large and highly specialized technical workforce, found the hard part was not model quality but scale itself: the productivity gains that showed up in greenfield work dropped off sharply against hundreds of millions of lines of existing code, and getting past the early adopters meant treating rollout as an organizational program — onboarding and shared principles acting as the change agent — rather than a tooling decision. The lever at this scale is an explicit internal platform investment: a common layer through which AI capability is provided, measured, and governed. Without it, every L2 win stays local and the organization cannot compound.

For enterprise — the organization with hundreds or thousands of engineers, often operating in a regulated industry, often with technology debt measured in decades — the maturity journey has a prerequisite that the smaller-company conversation routinely omits: legal and compliance as a design input, not a later constraint. An enterprise cannot agent access without knowing what the agent can see, what it can do, and who approved both. The organizations that are succeeding with AI at enterprise scale describe getting governance architecture right as the enabling move — not the concluding one. The pilot that lives in a sandbox indefinitely is usually not waiting for a better model. It is waiting for someone to design the credential scope and the audit trail that would allow it to move to production.

The insight that cuts across all three scales is that the phase transitions are always governance events, not capability events. An organization does not move from L2 to L3 because the models got better. It moves because someone made a structural decision: hardened a path, wrote down a convention, built a review system. The capability was available one level below. What unlocks the next level is always an organizational act.

Broader creation, narrower paths to ship

If execution is cheap, the natural move is to let more people execute. This is one of the most radical organizational consequences in the corpus, and Lisa Orr at Zapier states it as a provocation: "Your support team should ship code." Not file tickets for engineers to ship code — ship it themselves. When the cost of a competent first implementation collapses, the historical reason for routing all code through a narrow guild of engineers weakens with it.

But the provocation only works with its other half. Broader creation does not mean a free-for-all; it means widening who can start work while narrowing and hardening the path by which work ships. The support engineer can open the pull request. What protects the company is that the path from that pull request to production runs through the same tests, the same evals, the same review gates, the same permission boundaries that any change runs through. Democratized creation and stronger governance are not opposites. They are the two things that have to rise together, or the first one becomes a liability.

This is why roles are blurring and new ones are appearing at the same time. James Lowe's argument that every product needs an AI product manager — and that it should be you — is really an argument that someone has to own the judgment layer that cheap creation now demands: deciding which of the many possible artifacts is worth shipping, and shaping the constraints under which non-specialists create safely. Denys Linkov's work on structuring a modern AI team points the same direction: the team that ships dependable AI is not a bigger version of the old team. It mixes capabilities that used to live in separate departments, because the unit of work now crosses those departments by default.

The pressure reaches compensation and hiring too. When output is no longer a clean proxy for contribution, the old pay structures wobble — Arman Hezarkhani's proposal to pay engineers more like salespeople, on outcomes rather than effort, is one early attempt to re-anchor reward to value in a world where effort is cheap. And hiring inherits a strange new problem that Beth Glenfield names directly: when everyone interviews with AI, the old signals of competence stop discriminating, and the organization has to learn to hire for judgment and taste rather than for the ability to produce code that an agent can now produce for anyone.

Review becomes the bottleneck

Follow the cheap-execution argument one step further and it lands on a single organizational chokepoint. If one person can direct many agents, and more people can now create, then the total volume of work produced rises far faster than the human capacity to review it. Review — not generation — becomes the binding constraint on how fast the organization can actually move.

This is the structural fact behind the Monday-morning scene. The weekend produced a pile of pull requests, drafts, and automations. Every one of them needs a human judgment somewhere before the organization can trust it. And the supply of trustworthy human judgment did not increase over the weekend. Maggie Appleton, describing collaborative AI engineering from inside GitHub, captures the resulting pathology precisely: going fast without good alignment leads to wasted work, duplicate effort, and giant review queues with little context. The queue is where ungoverned speed goes to die.

The instinct to handle this by asking humans to review harder does not scale, because it asks the scarce resource to absorb the growth of the abundant one. The organizations that cope build the review function the way Chapter 4 argued they should build evals: as a system, not a heroic individual act. Layered validation, where automated checks and evals clear the routine cases so human attention concentrates on the consequential ones. Triage rules that decide what a human must see and what can ship on green. Roll-up visibility of the kind Eric Zakariasson describes in Cursor's software-factory work — a single surface that shows what every agent is doing and, crucially, what the human actually needs to look at — rather than a firehose of individual agent chatter.

The connection back to Chapter 4 is direct and load-bearing: in an , eval and review capacity is not a quality-assurance afterthought. It is the throughput limit of the entire company. You can only safely create as fast as you can trustworthily review.

Alignment debt is the new invisible tax

There is a subtler failure than an overflowing review queue, and it is the one most likely to catch good teams by surprise. When individuals each direct their own fleet of agents in private, every workflow can be locally efficient and the whole can still be globally incoherent. Two engineers solve the same problem two different ways. A feature gets built that quietly conflicts with another team's assumptions. Work is duplicated, contradicted, or wasted — not because anyone was careless, but because alignment never happened.

Maggie Appleton names the root cause with unusual precision: "None of our current tools give teams a shared space to discuss plans, gather the right context, and work with agents as a collective." The tooling optimized the individual loop — one developer, many agents — and left the collective loop unbuilt. Each person's context lives in their own session. The plans never meet until the pull requests collide.

This is worth treating as a distinct organizational liability, and "" is a useful name for it. Like technical debt, it accrues invisibly while things feel fast, and it comes due all at once — as the duplicated work, the conflicting implementations, the surprise feature nobody coordinated, the giant unmergeable pile. And like technical debt, the cure is not to slow down but to pay alignment earlier: to move shared planning, context-gathering, and visible work decomposition upstream of the agent fan-out, so that two dozen agents are working from one understood plan rather than two dozen private ones. The operational rule is that the plan, not the pull request, is where two people's work should first meet. If the first time two engineers' agent work touches is at the merge — the point Appleton diagnoses, where the plans never meet until the pull requests collide — alignment was already skipped, and the only question left is how expensive the collision is.

The deeper point is that as execution fans out, alignment has to move in the opposite direction — it has to concentrate and move earlier. The cheaper it is to start work, the more expensive it becomes to have started the wrong work in parallel twenty times. is the tax an organization pays for treating a collective activity as a collection of private ones.

Governance as the load-bearing wall

The sections above on democratized creation and converge on a single structural conclusion: in an , governance is not overhead. It is the mechanism by which the organization can afford to be fast.

This inverts the standard corporate framing. The standard framing treats governance as the brake — the process that slows things down in the name of safety and compliance. Under that framing, the AI-native dream is to minimize governance: to find the path of least institutional resistance from idea to deployed output. That framing produces companies that move fast and then revert, recall, or face consequences they did not budget for.

The organizations that have built durable AI-native practices describe a different relationship. Joel Hron, CTO of Thomson Reuters — a company deploying AI across legal, tax, and risk workflows where, as he puts it, the risks of being wrong are not particularly acceptable — describes the shift that gives this book its title: the north star has moved from assistants that are merely helpful to systems asked to produce output and make judgments and decisions on behalf of users. Making that move in high-stakes work is not primarily a model-quality question; it is a trust question. And the book's argument is that a colleague is trusted with consequential work not because they can never be wrong but because there is a system around them — accountability, escalation, review — that bounds the damage when they are. Agents earn colleague status the same way. Governance is not the obstacle to trust; it is what trust is built from.

The practical consequence is that governance is not a Phase 3 problem to be solved after the agents are running. It is a Day 1 design question. An agent running with unbounded credentials and no audit trail cannot be granted access to the production system — not because the risk team said no, but because there is no basis on which to say yes. Scoped credentials, bounded authority, and comprehensive audit logs do not constrain the agent's capability. They are what allows the agent to be deployed at all. Chapter 7 made this argument for the technical security model; here it applies at the organizational level.

The organizations succeeding with enterprise AI at scale — across finance, legal, and regulated health contexts — consistently describe the same enabling sequence: define the governance architecture first, then expand the scope of what agents can do within it. The pilot that lives in a sandbox indefinitely is usually not waiting for a better model. It is waiting for someone to design the credential scope, the audit trail, and the escalation path that would allow it to move from demonstration to production. The governance design is the unlocking work.

The resulting principle is a reframe of the question every enterprise team should be asking. Instead of "what is the minimum governance we can get away with?", ask "what is the lightest governance that earns us the trust to go fast?" The answer is always less than the instinctive first draft — layers of approval designed for human-speed work — and always more than the proof-of-concept default of no governance at all. The right amount is whatever makes the next grant of access defensible to the person who has to sign off on it.

The company becomes a harness for its own agents

Pull the threads together and a single shape emerges. The organization that handles cheap creation, the review bottleneck, , and governance is doing, at the scale of a company, exactly what Chapter 3 said a good codebase does for a coding agent. It is building a .

An has done for itself what a good codebase does for a coding agent: it has taken the judgment that used to live tacitly in senior people's heads and the hallway conversation and made it explicit, versioned, and available to both humans and agents. Shared standards instead of folklore. Written policies instead of "ask Sarah." Permission systems instead of trust by default. Review gates instead of hoping.

This is the synthesis the whole book has been building toward: an organization is the macro-scale version of the same object. A company is a for its own agents, human and machine alike, and AI-native advantage is the quality of that .

That is also why the advantage is so hard to copy. A competitor can buy the same models and the same seats tomorrow. What it cannot buy overnight is an operating model in which institutional judgment has been packaged into reusable constraints — broad paths to create, narrow paths to ship, review built as a system, alignment paid upfront, and standards legible to the agents doing the work. That is built, not purchased, and the building is the moat.

What this means for what endures

The is not the one with the most enthusiastic prompting culture, and it is not the one with the highest activity dashboards. It is the one that learned to convert cheap generation into trusted throughput — which turns out to require the least glamorous things in the building: clear standards, real review, honest measurement, and alignment that happens before the work fans out rather than after it collides.

Notice what that list does not contain. It does not contain a particular model, a particular vendor, or a particular protocol. Every concrete technology in this book will be replaced; some of it already has been between the talks that anchor these chapters and the page you are reading. What persists is the shape of the problem: cheap execution makes judgment scarce, scarce judgment has to be organized, and organizing it is an engineering discipline applied to an institution rather than a codebase.

The technical question and the organizational question, in the end, turn out to be the same question asked at two scales. How do you build a system in which delegated work compounds instead of fragmenting? For a single agent, the answer was a . For a company, the answer is the same word at a larger size. The final chapter asks what survives when the models, the tools, and the org charts have all turned over again — and the answer it reaches for is already visible here, in the parts of the that were never really about AI at all.

What to do with this

  • Audit your engineering dashboards for activity-vs-outcome confusion. If you are reporting PRs opened, commits, or lines generated, you are counting the abundant resource; replace or supplement those with outcome measures — rework rate, share of generated work that ships unreverted, and time spent in the review queue — because counting artifacts in a world where artifacts are cheap is counting the wrong thing.
  • Treat review and eval capacity as the company's throughput limit, not a QA afterthought. Build the review function as a system rather than asking humans to review harder: add layered validation so automated checks and evals clear routine cases, triage rules that decide what a human must see versus what can ship on green, and a single roll-up surface (à la Cursor's software-factory work) showing what each agent is doing and what the human actually needs to look at.
  • Move alignment upstream of the agent fan-out. Make the plan, not the pull request, the place two people's work first meets — shared planning, context-gathering, and visible work decomposition before the agents start — so two dozen agents work from one understood plan instead of colliding at merge. accrues invisibly while things feel fast and comes due all at once.
  • Widen who can start work, then harden the single path by which it ships. Take Lisa Orr's provocation seriously — let non-engineers (e.g., support) open pull requests — but route every change through the same tests, evals, review gates, and permission boundaries, because democratized creation and stronger governance have to rise together or the first becomes a liability.
  • Name an owner for the judgment layer. Following James Lowe, designate an AI product manager who decides which of the many cheap artifacts is worth shipping and shapes the constraints under which non-specialists create safely — this judgment does not allocate itself once creation is cheap.
  • Close the last human-only path rather than buying more seats. The holdout process — the review, handoff, or plan that still assumes human-speed work — is the constraint. Partial delegation makes individual people faster inside an unchanged workflow; redesigning the workflow requires finding and converting that holdout. Identify it and convert it, because more seats in an unchanged process does not change the process.
  • Diagnose before you prescribe. Use the five-level model as a diagnostic, not a ranking: an organization at L2 trying to solve an L3 problem with L1 tools will spend a year optimizing throughput when the queue backing up is consistency. Find the actual bottleneck — access, measurement, standardization, redesign, or externalization — and work on that specifically.
  • Design governance architecture before expanding scope. Ask not "what is the minimum governance we can get away with?" but "what is the lightest governance that earns us the trust to go fast?" The pilot that lives in a sandbox indefinitely is usually not waiting for a better model — it is waiting for someone to define the credential scope, the audit trail, and the escalation path that would make production access defensible. Do that work first.
  • Write it down before you scale it. For small teams: the ambient alignment that makes your current AI usage feel coherent will not survive the next hiring round. Write the specs, decision logs, and conventions now — not because anyone needs them today, but because the next ten people will.
EVIDENCE OF SOURCE · CHAPTER 09 · VIDEOS

18 claims · 71 source anchors

Evidence — Source Anchors

The important transition is from suggestion to delegated execution

  • from helpfulness to productive
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • I think they need more
    #3 — Jacob Lauritzen, Legoraconfidence: high
  • most primitives the magic happens when you combine these things together
    #138 — Sam Bhagwat, Mastra.aiconfidence: high

Harness quality is a major determinant of coding-agent quality

  • a good harness is really operationalized around giving the model text at the right time
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • there's so much work that has been put in uh over the last you know 20 to 30 years around the automated validation and verification of software that you build
    #57 — Eno Reyes, Factory AIconfidence: high
  • instead of micromanaging, what I'm doing is I'm scaffolding and providing context.
    #190 — Eric Hou, Augment Codeconfidence: high
  • identifying problems with the code because if there's no problems then it's probably high quality code
    #179 — Josh Albrecht, Imbueconfidence: high

Agent-ready codebases are designed, not discovered

  • agents MD files an open standard
    #57 — Eno Reyes, Factory AIconfidence: high
  • context deficit as the biggest blocker.
    #190 — Eric Hou, Augment Codeconfidence: high
  • a garbage codebase you're going to get
    #621 — Matt Pocockconfidence: high

Evals are a control system, not just a test suite

  • improvement without measurement is limited and imprecise.
    #125 — Ido Pesok, Vercel v0confidence: high
  • We still want to build reliable scalable applications and that is still hard
    #184 — Samuel Colvin, Pydanticconfidence: high
  • eval to us it's actually the same problem from a from a systems perspective.
    #628 — Phil Hetzel, Braintrustconfidence: high
  • small CLI tool that we call eval tool
    #689 — Lawrence Jones, incident.ioconfidence: high
  • designed to allow agents to leverage our eval suite files.
    #689 — Lawrence Jones, incident.ioconfidence: high
  • classic benchmark maxing.
    #746 — Ara Khan, Clineconfidence: high
  • There are right ways to use them. There are wrong ways to use them.
    #746 — Ara Khan, Clineconfidence: high

Human oversight works best as an architectural layer, not an afterthought

  • There needs to be human interaction for approvals or other reasons and of course they need to be able to be uh able to run in parallel for efficiency
    #167 — Preeti Somal, Temporalconfidence: high
  • dial these agency dials far up.
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • maintaining a factory would require you to have an overview of the processes you want your coding agents to go through.
    #629 — Eric Zakariasson, Cursorconfidence: high

The harness is evolving from a local loop into a staged software factory

  • getting to a place where you can build your own like software factory
    #629 — Eric Zakariasson, Cursorconfidence: high
  • unified agent harness that will manage
    #632 — Vaibhav Srivastav & Katia Gil Guzman, OpenAIconfidence: high
  • parallel agents working together to fix
    #42 — Robert Brennan, OpenHandsconfidence: high
  • The difference with missions is that we run features serially.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • Our longest mission ran for 16 days
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • We just kind of gave each role its own kind of context window.
    #691 — Ash Prabaker & Andrew Wilson, Anthropicconfidence: high
  • it's no longer about the model or the agent. It's about the process.
    #743 — Vincent Koc, OpenClawconfidence: high

The context gap increasingly includes capability packaging and progressive disclosure

  • doesn't have to be loaded immediately to context.
    #683 — Pedro Rodrigues, Supabaseconfidence: high
  • specifically with progressive disclosure.
    #654 — Nick Nisi & Zack Proser, WorkOSconfidence: high
  • 49% reduction of the initial load.
    #625 — Sam Morrow, GitHubconfidence: high
  • rich interactive components that render directly in the chat.
    #747 — Marlene Mhangami & Liam Hampton, GitHubconfidence: high

AI-native advantage depends on organizational coherence, not output volume alone

  • you'll be able to assemble agent teams that can complete tasks orders of magnitude harder than what you can complete with a single agent today.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • observing their workflows, their pain points, co-designing solutions with them
    #693 — Eoin Mulgrew, 10 Downing Streetconfidence: high
  • maintaining a factory would require you to have an overview of the processes you want your coding agents to go through.
    #629 — Eric Zakariasson, Cursorconfidence: high

Coordination is the unsolved runtime primitive for multi-agent systems

  • the thing that's missing for me is coordination.
    #704 — Lou Bichard, Onaconfidence: high
  • through sort of state machines, you know, by building out workflows and effectively state machines
    #704 — Lou Bichard, Onaconfidence: high
  • They step on each other's changes. They duplicate work. They make inconsistent architectural decisions.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • we have the two agents basically negotiate what done actually means.
    #691 — Ash Prabaker & Andrew Wilson, Anthropicconfidence: high

Context engineering is a primary engineering discipline, not a prompt trick

  • picking up the right documents and answering those questions is a really cool use case.
    #100 — Ofer Mendelevitch, Vectaraconfidence: high
  • cool load generator that Kalen wrote that lets you configure agent swarms uh and agent subtasks with very specific SLOs's
    #104 — Val Bercovici, WEKAconfidence: high
  • connect the dots with graph technology and solve problems like context engineering
    #105 — Stephen Chin, Neo4jconfidence: high
  • the right agent in the future is going to be this system that decides what type of search
    #157 — Will Bryk, Exa.aiconfidence: high

Enterprise usefulness scales with working-set quality, not corpus size

  • about 73% of LM customers implementing use cases say that factual accuracy is their top challenge right now.
    #100 — Ofer Mendelevitch, Vectaraconfidence: high
  • how Harvey tackles retrieval, the types of problems there are and then the challenges that come up with that all with like retrieval quality, scaling, uh security,
    #154 — Calvin Qi (Harvey) & Chang She (Lance)confidence: high
  • how to build enterprise aware agents. How to bring the brilliance of AI into the messy complex realities
    #193 — Chau Tran, Gleanconfidence: high
  • you don't need a trillion at once, you need the right million.
    #756 — Kuba Rogut, Turbopufferconfidence: high

Enterprise MCP adoption converges on gateways, blessed platforms, and a root of trust

  • we think that the goal for a secure this for any security team is to is to bless one platform.
    #624 — Karan Sampath, Anthropicconfidence: high
  • challenges we've faced building and scaling our remote server, how we've overcome them,
    #625 — Sam Morrow, GitHubconfidence: high
  • if we continue this pattern for hundreds or thousands of agents, we've got a pretty big security problem on our hand.
    #150 — Jared Hanson, Keycardconfidence: high

AI-native advantage is an operating-model redesign, not a procurement decision

  • there's a 10x difference between an org where 90% of the engineers are using AI versus an org where 100% of the engineers are using AI.
    #65 — Dan Shipper, Everyconfidence: high
  • 80% of respondents say LLMs are working well at work,
    #137 — Barr Yaron, Amplify (2025 AI Engineering Report)confidence: high
  • It's about evolving from AI features sprinkled into the product to rethinking how you plan, build, and deliver value all through an AI lens.
    #199 — From Hype to Habit (AI-first SaaS)confidence: high
  • writing code has never been the bottleneck, right? We can in uh we can increase productivity a bit by helping with code completion, but our our biggest bottlenecks are elsewhere within the SDLC.
    #62 — Justin Reock, DX (acq. Atlassian)confidence: high

Broader creation requires tighter review and governance — they rise together or the first becomes a liability

  • at Zapier we are empowering our support team to ship code.
    #69 — Lisa Orr, Zapierconfidence: high
  • I'm going to make the case for the AI product manager. I'm going to argue that AI expertise is really important for this role.
    #162 — James Lowe, i.AIconfidence: high
  • all these skills that you're prioritizing don't necessarily need to be one person. They can be multiple people.
    #188 — Denys Linkov, Wisedocsconfidence: high
  • I'm going to talk to you today about how I believe AI is breaking how we hire technically.
    #207 — Beth Glenfield, DevDayconfidence: high
  • the challenge becomes who do I say no to?
    #743 — Vincent Koc, OpenClawconfidence: high

Activity-based metrics misread motion as progress in AI-augmented work

  • these are not productivity metrics. They're useful, but you cannot just kind of use them like maximize them to maximize developer productivity.
    #79 — Yegor Denisov-Blanch, Stanford (120k devs study)confidence: high
  • I do think that AI increases developer productivity, but there's also cases in which it decreases developer productivity.
    #195 — Yegor Denisov-Blanch, Stanford (100k devs study)confidence: high
  • just plain old PR throughput. How many pull requests does the average engineer merge per week?
    #101 — Nick Arcolano, Jellyfish (20M PRs)confidence: high
  • I'm going to talk about how we pay engineers. And we pay engineers like salespeople.
    #63 — Arman Hezarkhani, Tenexconfidence: high

Review capacity is the throughput limit of an AI-native organization

  • this talk uh is called uh one developer, two dozen agents, zero alignment. Uh this is the case for why we need collaborative AI engineering.
    #623 — Maggie Appleton, GitHubconfidence: high
  • you should have multiple different stages where you you plan it, you produce it, you review it and you essentially follow the whole uh SLC
    #629 — Eric Zakariasson, Cursorconfidence: high
  • every software engineer becomes a code reviewer as basically their primary job.
    #54 — Max Kanat-Alexander, Capital Oneconfidence: high

Alignment debt is the AI-native equivalent of technical debt

  • None of our current tools give teams a shared space to discuss plans, gather the right context, and work with agents as a collective.
    #623 — Maggie Appleton, GitHubconfidence: high
  • if we believe that all of our products are for like for all time going to be probabistic, then like we probably have to figure out how this world works.
    #160 — Ben Stein, Teammatesconfidence: high
  • you kind of like frontload uh the context to the agents either through like a plan or a long spec and then you send them off
    #629 — Eric Zakariasson, Cursorconfidence: high

Cheap generation raises the value of taste and judgment rather than lowering it

  • software fundamentals matter now more than they actually ever have.
    #1 — Matt Pocock, AI Heroconfidence: high
  • capable of doing everything um immediately
    #6 — Tuomas Artman & Gergely Oroszconfidence: high
  • intentionally designed to put friction
    #14 — Armin Ronacher & Cristina Poncela Cubeiroconfidence: high
AI QUALITY · CHAPTER 09 · MASH JUDGES
1 unsupported claim — ship-blocker

scored on version git:2f2668c

EVIDENCE OF SOURCE · CHAPTER 10
FIG. 10 · BEFORE · AFTERCLICK · SCROLL · ZOOM

AI Engineer Book · Ch 10

What Endures

1/4

FIG. 10.0 · OPENER

Transient layer vs durable layer

Click to enlarge

CH10

CH. 10 // Drafting
1,497 words7 min read
CHAPTER 10/1,497 words/Drafting

Chapter 10 — What Endures

A book about a field moving this fast has an obvious problem: by the time you read it, some of its examples will be obsolete. Models named in these chapters have already been superseded. Frameworks have been renamed. Tools that anchored an argument have shipped a version that changes the details. This final chapter is about the part that does not expire — the operating model that survives the churn, and the reason it survives.

Because the churn is real but shallow. New models arrive, old frameworks get rebranded, and every layer of the stack takes its turn claiming to be the one that finally matters. Underneath that surface, the engineering pattern has been remarkably stable across every chapter of this book. What endures is not a model or a framework. It is a way of turning machine capability into dependable work.

The argument the whole book was making

Figure 10.1/Churn vs durableCLICK TO ENLARGE

Read back across the chapters and a single shape emerges, stated nine different ways.

The shift that started the book was from suggestion to delegation — from a system that tells you things to one that does work you rely on. Everything after that was a consequence. Delegation stretched the failure surface from a single response across a whole workflow, and each chapter took up one stretch of it. Taste became the scarce input once execution got cheap (Chapter 2). became the way to encode that taste into the environment an agent works in (Chapter 3). Evals became the control system that tells you whether any of it is working (Chapter 4). Context became the infrastructure that decides what the model can even see (Chapter 5). Runtimes became what carries the work across time and keeps a human in the loop (Chapter 6). Identity and bounded authority became the price of letting a system act (Chapter 7). Realtime made every one of those failures audible at once (Chapter 8). And the organization turned out to be the same object at the largest scale — a for its own agents (Chapter 9).

The throughline connecting all of them is one , repeated until it stops sounding like a and starts sounding like a description: reliability comes far less from model cleverness than from the scaffolding around the model. That is the book's center of gravity, and it is also the thing most likely to still be true after the specific scaffolding in these chapters has been replaced.

Why the principles outlast the tools

Figure 10.2/Constrained delegationCLICK TO ENLARGE

It is worth being precise about why the pattern endures, rather than just asserting that it does, because the reason is what makes it trustworthy.

Better models do not remove the need for scaffolding. They raise the stakes on it. A more capable model executes a badly framed task more confidently, produces more plausible wrong output, and fails faster and at larger scale when the surrounding system is weak. Every improvement in raw capability makes the , the evals, the context discipline, and the authority boundaries matter more, because the blast radius of an unscaffolded mistake grows with the capability of the thing making it. This is the quiet inversion at the heart of the book: the better the models get, the more the engineering around them decides the outcome.

Omar Khattab's framing — "engineering AI systems that endure" — points at the same durability. The systems that last are not the ones bonded to a particular model's quirks. They are the ones built so that the model is a replaceable component inside a structure that holds its shape when the component is swapped. That is ordinary engineering wisdom, and its survival into the AI era is precisely the point. Dax Raad puts the provocation at its sharpest: "AI changes nothing." He is wrong on the surface and right underneath — the interfaces and economics changed enormously, but the discipline of turning capability into dependable systems did not change at all. It got more important.

The durable pattern: constrained delegation

Figure 10.3/Cost of weak standardsCLICK TO ENLARGE

If the book reduces to one transferable idea, it is this: constrained delegation.

Give the system a clear task. Put it in a prepared working environment. Hand it the right slice of context, not all the context. Give it a way to preserve state across the gaps. Grant it powers narrow enough to trust and revoke. Keep human judgment focused on the consequential edges instead of spread thin across everything or pretending it can be removed entirely. Every chapter was an instance of that sentence. The is the prepared environment. Evals are how you know the constraint is holding. Context architecture is the right slice. Runtimes are the preserved state. Security is the narrow power. The is the judgment at the edges.

None of those pieces depends on a particular vendor, protocol, or model generation. They are the stable structure; the tools are the parts you replace inside it. A team that internalizes can adopt next year's model without rethinking how it works, because the model was always meant to be the swappable part. Mario Zechner's account of building in a world of slop is, at bottom, a constrained-delegation story: the discipline is what lets you use powerful, unreliable generation without drowning in its output.

What this asks of the reader

The practical consequence is a reallocation of attention. The instinct in a fast-moving field is to chase the tools — to treat staying current with the newest model and framework as the core of the work. That instinct is not wrong, but it is not where the durable advantage lives. The teams that win the next decade will not be the ones that adopted the newest tools fastest. They will be the ones that built the structure to turn cheap generation into trusted throughput — and could therefore absorb every new tool without chaos.

So the question this book leaves you with is not "which model should I use?" It is the question that survives every answer to that one: what does the system around the model have to be, before I can trust what comes out of it? That question doubles as a checklist you can run against any AI feature before shipping it — clear intent, prepared environment, the right context, preserved state, bounded authority, measurement that tells the truth, and human judgment at the edges that matter. A gap in any one of those is where the next failure will come from. That list does not change when the model does. It is the part that endures.

The book opened with a shift from assistant to — from a system you consult to one you trust with work. Everything since has been an answer to the same question at rising scale: what has to be true before that trust is earned? The models will keep getting better, and the answer will keep mattering more. The colleague was never going to be trustworthy because it was clever. It is trustworthy, if at all, because of the structure we build around it — and that structure, not the cleverness, is what endures.

What to do with this

  • Invest in the scaffolding, not the model: , evals, context discipline, and authority boundaries are where the durable advantage lives, because a more capable model executes a badly framed task more confidently and fails faster and at larger scale when the surrounding system is weak. The blast radius of an unscaffolded mistake grows with the capability of the thing making it — so the better models get, the more this work pays off.
  • Build so the model is a replaceable component. Treat it as the swappable part inside a structure that holds its shape when you swap it, rather than bonding your system to a particular model's quirks. A team that internalizes this can adopt next year's model without rethinking how it works.
  • Run as a design template for every AI feature: give the system a clear task, a prepared working environment, the right slice of context (not all of it), a way to preserve state across the gaps, and powers narrow enough to trust and revoke — keeping human judgment focused on the consequential edges rather than spread thin or removed entirely.
  • Before shipping any AI feature, run the closing checklist as a pre-ship test: clear intent, prepared environment, the right context, preserved state, bounded authority, measurement that tells the truth, and human judgment at the edges that matter. A gap in any one of those is where the next failure will come from.
  • When deciding where to spend attention in a fast-moving field, reallocate it from chasing tools toward building the structure that turns cheap generation into trusted throughput — the teams that win the next decade are the ones who can absorb every new tool without chaos, not the ones who adopt the newest tool fastest.
EVIDENCE OF SOURCE · CHAPTER 10 · VIDEOS

10 claims · 36 source anchors

Evidence — Source Anchors

The important transition is from suggestion to delegated execution

  • from helpfulness to productive
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • I think they need more
    #3 — Jacob Lauritzen, Legoraconfidence: high
  • most primitives the magic happens when you combine these things together
    #138 — Sam Bhagwat, Mastra.aiconfidence: high

Reliability comes less from model cleverness than from surrounding scaffolding

  • The important thing is not the code but the prompt and the guardrails that got you there.
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • Agents have intelligence and capabilities, but not always expertise that we need for real work.
    #83 — Barry Zhang & Mahesh Murag, Anthropicconfidence: high
  • these are three kind of like ingredients which are pretty simple and pretty basic, but I think provide an interesting kind of like first principles approach for how to think about
    #198 — Harrison Chase, LangChain/LangGraphconfidence: high

Specs are not paperwork; they are executable intent

  • specs are natural language, you're using specs as a control surface to explain what you want the system to do.
    #40 — Al Harris, Amazon Kiroconfidence: high
  • leaving breadcrumbs, documentation, ADRs, persona oriented documentation around what a good job looks like.
    #16 — Ryan Lopopolo, OpenAIconfidence: high

Context failure is often a system-assembly problem, not simply a small-context-window problem

  • the reason context platform engineering is so important is it dramatically simplifies reaching maximum KV cache hit rates
    #104 — Val Bercovici, WEKAconfidence: high
  • connect the dots with graph technology and solve problems like context engineering
    #105 — Stephen Chin, Neo4jconfidence: high
  • irrelevant facts pollute memory.
    #218 — Daniel Chalef, Zepconfidence: high
  • LLMs and tools are orchestrated through predefined code paths.
    #193 — Chau Tran, Gleanconfidence: high
  • Agents look at the starting point, end point and try to provide you the results.
    #752 — Nupur Sharma, Qodoconfidence: high
  • the more the tools, the more issues you have.
    #752 — Nupur Sharma, Qodoconfidence: high

Human oversight works best as an architectural layer, not an afterthought

  • There needs to be human interaction for approvals or other reasons and of course they need to be able to be uh able to run in parallel for efficiency
    #167 — Preeti Somal, Temporalconfidence: high
  • dial these agency dials far up.
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • maintaining a factory would require you to have an overview of the processes you want your coding agents to go through.
    #629 — Eric Zakariasson, Cursorconfidence: high

High-stakes systems tune agency instead of maximizing it

  • a binary thing but as a lever that you can dial
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • agentic workflows we can plan and execute
    #201 — Yogendra Miraje, Factsetconfidence: high
  • send it to me for approval.
    #202 — Rita Kozlov, Cloudflareconfidence: high
  • credentials, payments, and checkout require determinism.
    #745 — Steve Kaliski, Stripeconfidence: high

AI-native advantage depends on organizational coherence, not output volume alone

  • you'll be able to assemble agent teams that can complete tasks orders of magnitude harder than what you can complete with a single agent today.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • observing their workflows, their pain points, co-designing solutions with them
    #693 — Eoin Mulgrew, 10 Downing Streetconfidence: high
  • maintaining a factory would require you to have an overview of the processes you want your coding agents to go through.
    #629 — Eric Zakariasson, Cursorconfidence: high

Realtime AI quality is primarily a coordination and latency-engineering problem, not a model-quality problem

  • the main bottleneck is becoming the tool call,
    #662 — Neil Zeghidour, Gradium AIconfidence: high
  • the entire stack of understanding, producing an answer, and pronouncing it to be around 200 milliseconds.
    #662 — Neil Zeghidour, Gradium AIconfidence: high
  • you have a tool call or open router that is going to have a latency between 500 milliseconds and 4 seconds.
    #662 — Neil Zeghidour, Gradium AIconfidence: high
  • wrapped it up into its own first class primitive,
    #661 — Luke Harries, ElevenLabsconfidence: high
  • the latency is key here
    #663 — Samuel Humeau, Mistralconfidence: high
  • knowing who said what is as important as what was said
    #742 — Hervé Bredin, pyannoteconfidence: high

TTS architecture is converging on LLM architecture

  • pretty much uh everybody is using an auto reggressive decoder backbone
    #663 — Samuel Humeau, Mistralconfidence: high
  • the king use case for text to speech is uh its usage within agents
    #663 — Samuel Humeau, Mistralconfidence: high
  • the intelligence is baked directly into the model.
    #755 — Thor Schaeff, Google DeepMindconfidence: high

Cheap generation raises the value of taste and judgment rather than lowering it

  • software fundamentals matter now more than they actually ever have.
    #1 — Matt Pocock, AI Heroconfidence: high
  • capable of doing everything um immediately
    #6 — Tuomas Artman & Gergely Oroszconfidence: high
  • intentionally designed to put friction
    #14 — Armin Ronacher & Cristina Poncela Cubeiroconfidence: high
AI QUALITY · CHAPTER 10 · MASH JUDGES

scored on version git:2f2668c

AUDIOBOOK
00:00:00 / 00:17:22