← All chaptersFrom Copilot to Colleague

Chapter 06 · 10 min read

Runtimes, State, and the Human Control Plane

Durable agents, replay vs snapshot, and why autonomy needs architecture.

Experience in 3D
EVIDENCE OF SOURCE · CHAPTER 06
CHAPTER 06/2,194 words/Drafting

Chapter 6 — Runtimes, State, and the Human Control Plane

A chatbot can get away with amnesia. A production agent cannot.

Short-lived assistance can live inside a conversational loop — a turn comes in, a turn goes out, and nothing has to survive past the reply. Delegated work cannot live there. The moment a system has to persist across retries, timeouts, approvals, multiple tools, and possibly many parallel workers, the central problem stops being next-token cleverness and becomes execution semantics. The system has to preserve state, survive interruption, expose its progress, and resume without losing the thread. This is the chapter where the book's argument turns from what the model knows to how the work itself stays alive between the moments the model is thinking.

Samuel Colvin at Pydantic states the lesson in the voice of someone who learned it in production: "building production AI agents reveals a harsh truth — stateless architectures that work for simple demos become impossibly painful at scale." The word painful is doing honest work. The failure is not dramatic. It is the slow accumulation of edge cases — the timeout that loses an hour of work, the retry that double-charges, the approval that arrives after the agent already acted — that turns a magical demo into a system nobody trusts.

Why stateless demos break

Figure 06.1/Transcript vs workflowCLICK TO ENLARGE

The seductive thing about a chat-loop agent is that it works immediately. You wire a model to some tools, let it loop, and on a short, happy-path task it looks like the future. The trouble is that the architecture has no memory of what has actually happened — only a transcript of what was said. And a transcript is not state.

The distinction matters the instant anything goes wrong. A real delegated task runs long enough to hit a timeout, a rate limit, a flaky tool, a network blip, a required human approval. In a stateless loop, every one of those is a small catastrophe, because there is no durable record of where the work had gotten to. The system cannot answer the only questions that matter under failure: what has been done, what is in flight, and what is safe to retry. It can replay the conversation, but it cannot resume the work, because it never modeled the work as something separate from the conversation about it.

This is why so many impressive agent demos collapse when a team tries to operationalize them. The model is capable, the prompts are decent, the context is strong — and the system still falls over, because it was built like a chat session when the job required a workflow. So run the test before you scale a demo: ask whether the system can answer what has been done, what is in flight, and what is safe to retry without replaying the conversation. If the only record is a transcript, you have a chat session, not a workflow — and the fix is not a smarter model but a different architecture underneath.

Durable execution is the runtime requirement

Figure 06.2/The human control planeCLICK TO ENLARGE

The architecture that survives reality is durable execution: treat the long-running task as structured, checkpointed execution rather than a growing transcript. Preeti Somal at Temporal puts the stakes plainly: agentic systems "must scale and provide durability and reliability — otherwise, no one's going to trust your agent." Trust, in her framing, is not a property of the model's answers. It is a property of the runtime's behavior under failure.

What durability buys is the ability to distinguish what was said from what has actually happened. A records each completed step, so that when a tool call times out on the ninth step of a twelve-step task, the system resumes from the ninth step rather than restarting from zero or, worse, redoing a side effect that already fired. Somal describes pushing the reliability semantics down into the runtime so they leave the prompt entirely: "nowhere in there will there be statements like, if something fails keep retrying it — all of those pieces are handled" by the execution layer. That is the architectural move. Retries, timeouts, and resumption stop being things the agent has to reason about turn by turn and become guarantees the runtime provides. The prompt gets to be about the task; the runtime takes care of survival.

The same talk names the second thing durability gives you, and it points straight at the next section: "we also store all of the workflow history, so that you can look at the visibility of what is happening as your agent is navigating this complex set of interactions." History is not just for recovery. It is for inspection — and inspection is where humans re-enter the system.

The human control plane is an architectural layer

Figure 06.3/Subagents need recompositionCLICK TO ENLARGE

It is tempting to treat human oversight as a temporary crutch — something you keep around until the models get good enough to remove it. The corpus argues the opposite. In high-value systems, human control is not a phase on the way to full autonomy. It is a permanent architectural layer, designed in rather than bolted on.

Joel Hron at Thomson Reuters gives the clearest frame for what that layer regulates. Autonomy, he argues, is not a switch you flip but a dial you tune: agency is "not a binary thing but a lever that you can dial" up or down depending on how irreversible, risky, and observable the work is. A low-stakes draft can run with the dial turned far up. A filing with professional consequences runs with it turned down, with explicit approval points where a human re-enters before the system acts. The design question is never "autonomous or not." It is "how much agency, at which steps, with what review."

The mechanism that makes those approval points workable is exactly the workflow history durability provides. Hron describes deep-research systems whose long-running behavior becomes "the trajectories that the model would be following along its path of answering this particular type of legal question" — inspectable paths a human can audit rather than opaque jumps from question to answer. The control plane is built from these surfaces: the approval gates where a human authorizes the next step, the roll-up views that show what the system is doing without drowning the reviewer in raw agent chatter, the trajectory and history records that make an action reconstructable after the fact. Eric Zakariasson at Cursor describes the roll-up form of this directly — the human needs "an overview of the processes," a single surface answering what every worker is doing and what actually needs a person's attention, rather than a firehose of individual logs.

This is the same human-judgment-at-the-edges principle the book keeps returning to, now given a concrete home in the runtime. The point of the control plane is not to slow the system down. It is to focus scarce human attention on the consequential moments and let the runtime carry everything else — which gives you a design test for any review surface: if it surfaces raw agent chatter instead of a roll-up of what needs a person, or if it gates a reversible low-stakes step the way it gates an irreversible one, it is spending human attention in the wrong place.

Parallelism raises the stakes on coordination

Figure 06.4/Agency is a dial, not a switchCLICK TO ENLARGE

Everything so far concerns a single durable worker. The architecture gets harder, and more interesting, the moment teams reach for many workers at once — and this is where the corpus is most unsettled.

The appeal is obvious: if one agent is leverage, a fleet is more leverage. OpenAI's Codex team describes the mechanism, spinning off "a master task into decomposable parallel and independent tasks." But the precondition is in that phrase: parallelize only work that is genuinely decomposable and independent, and only after you have a way to recompose, inspect, and route the output back to a human at the right moment. Add workers before you have that, and parallelism just manufactures chaos faster — more diffs, more conflicts, more review than anyone can absorb, which is exactly the alignment-debt failure Chapter 9 will name at organizational scale. Teams go wrong treating "spin up more agents" as the win when the recomposition layer is the actual bottleneck.

Lou Bichard at Ona sharpens the diagnosis to a single missing piece. The runtime, he argues, is solved — "there are many options for this now, sandboxes and containers" — and so are triggers and orchestration. "The thing that's missing," he says, "is coordination": the agent-native primitive that lets parallel workers pick up tasks, signal completion, and hand off without a human stitching them together. And he is pointed about what is not that primitive: "GitHub is not a coordination layer for agents — it gets incredibly overwhelming." His candidate building blocks are exactly this chapter's subject — "state machines, by building out workflows and effectively state machines" — which lands the parallelism question squarely back on durable execution.

What makes the corpus credible here is that the teams shipping production multi-agent systems have not agreed on an answer; they have each substituted a known mechanism for the missing one. Factory runs features serially with one active writer — "serial execution with targeted internal parallelization" — eliminating the coordination problem by construction, and reports a longest mission of sixteen days. Anthropic's long-running agents take a third path: a planner-generator-evaluator loop where each role gets "its own kind of context window" and the agents "negotiate what done actually means" through a contract written to files on disk before any code is produced — a capability curve their team traces from roughly a one-hour autonomous run to twelve hours on the same simple . Serial execution, file-based contracts, state machines plus durable execution: three substitutes for one primitive that does not yet exist. The honest chapter names the gap and shows the three things teams actually ship, rather than pretending there is a consensus building block.

The runtime is where intelligence becomes dependable

Pull the chapter together and its is structural. A machine colleague is not a model with tools attached. It is a model inside an operating environment — and that environment is what determines whether bursts of intelligence become dependable delegated work.

The environment has named parts now. Durable execution, so the work survives interruption and resumes instead of restarting. Explicit state and workflow history, so the system can answer what has actually happened. A of approvals, roll-up views, and inspectable trajectories, so people supervise at the consequential edges. And, once there are many workers, a coordination story — even if today that story is a chosen substitute rather than a solved primitive. None of these live in the model. All of them live in the runtime, and all of them are the difference between a demo and a system.

The moment a durable, long-running agent can act on its own — with state, tools, and the authority to use them over time — bounding that authority becomes the price of letting it act at all. Identity, permissions, sandboxes, audit: that is the next chapter's subject. Durability gave the agent staying power. Security decides what it is allowed to do with it.

What to do with this

  • Before you scale any agent demo, run the failure-state test: can the system answer what has been done, what is in flight, and what is safe to retry — without replaying the conversation? If the only record is a transcript, you have a chat session, not a workflow. The fix is a different architecture, not a smarter model.
  • Push retry, timeout, and resumption semantics down into the runtime so they leave the prompt entirely — aim for a state where "if something fails keep retrying it" appears nowhere in your prompt because the execution layer handles it. Checkpoint each completed step so a tool timeout on step nine resumes from step nine, not from zero and not by re-firing a side effect that already ran.
  • Store full workflow history, and use it for two distinct jobs: recovery (resuming after failure) and inspection (a human auditing what the agent actually did). Treat the trajectory record as the surface humans review, not an afterthought.
  • Stop asking "autonomous or not" and instead set the agency dial per step — tuned by how irreversible, risky, and observable that step is. Turn it up for a low-stakes draft; turn it down with an explicit approval gate before an action with real consequences.
  • Audit your review surfaces: if one shows raw agent chatter instead of a roll-up of what needs a person, or gates a reversible step as strictly as an irreversible one, it is spending scarce human attention in the wrong place. Build the roll-up overview that answers what every worker is doing and what actually needs review.
  • Don't reach for more parallel workers until you have a recomposition layer — coordination is the missing primitive, and GitHub is not it. Until an agent-native coordinator exists, pick a deliberate substitute: serial execution with one active writer, file-based "negotiate what done means" contracts, or state machines plus durable execution.

21 claims · 88 source anchors

Evidence — Source Anchors

The important transition is from suggestion to delegated execution

  • from helpfulness to productive
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • I think they need more
    #3 — Jacob Lauritzen, Legoraconfidence: high
  • most primitives the magic happens when you combine these things together
    #138 — Sam Bhagwat, Mastra.aiconfidence: high

Chat is an insufficient control surface for long-running or high-stakes work

  • Chat is one-dimensional. It's a very low bandwidth interface,
    #3 — Jacob Lauritzen, Legoraconfidence: high
  • we're asking AI systems to now produce output and produce judgments and decisions
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • handle state potentially over long periods of time. There needs to be human interaction for approvals
    #167 — Preeti Somal, Temporalconfidence: high

Reliability comes less from model cleverness than from surrounding scaffolding

  • The important thing is not the code but the prompt and the guardrails that got you there.
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • Agents have intelligence and capabilities, but not always expertise that we need for real work.
    #83 — Barry Zhang & Mahesh Murag, Anthropicconfidence: high
  • these are three kind of like ingredients which are pretty simple and pretty basic, but I think provide an interesting kind of like first principles approach for how to think about
    #198 — Harrison Chase, LangChain/LangGraphconfidence: high

Harness quality is a major determinant of coding-agent quality

  • a good harness is really operationalized around giving the model text at the right time
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • there's so much work that has been put in uh over the last you know 20 to 30 years around the automated validation and verification of software that you build
    #57 — Eno Reyes, Factory AIconfidence: high
  • instead of micromanaging, what I'm doing is I'm scaffolding and providing context.
    #190 — Eric Hou, Augment Codeconfidence: high
  • identifying problems with the code because if there's no problems then it's probably high quality code
    #179 — Josh Albrecht, Imbueconfidence: high

Specs are not paperwork; they are executable intent

  • specs are natural language, you're using specs as a control surface to explain what you want the system to do.
    #40 — Al Harris, Amazon Kiroconfidence: high
  • leaving breadcrumbs, documentation, ADRs, persona oriented documentation around what a good job looks like.
    #16 — Ryan Lopopolo, OpenAIconfidence: high

Evals are a control system, not just a test suite

  • improvement without measurement is limited and imprecise.
    #125 — Ido Pesok, Vercel v0confidence: high
  • We still want to build reliable scalable applications and that is still hard
    #184 — Samuel Colvin, Pydanticconfidence: high
  • eval to us it's actually the same problem from a from a systems perspective.
    #628 — Phil Hetzel, Braintrustconfidence: high
  • small CLI tool that we call eval tool
    #689 — Lawrence Jones, incident.ioconfidence: high
  • designed to allow agents to leverage our eval suite files.
    #689 — Lawrence Jones, incident.ioconfidence: high
  • classic benchmark maxing.
    #746 — Ara Khan, Clineconfidence: high
  • There are right ways to use them. There are wrong ways to use them.
    #746 — Ara Khan, Clineconfidence: high

Realistic evals must be grounded in natural tasks and operational history

  • task should be natural and sourced from the real world and then you should be able to reliably grade them.
    #72 — Naman Jain, Cursorconfidence: high
  • If you build your application in a type safe way, if you use frameworks that allow it to be type safe, you can refactor it with confidence much more quickly.
    #184 — Samuel Colvin, Pydanticconfidence: high
  • Dynamic data sets have real world alignment.
    #153 — Quotient AI + Tavilyconfidence: high

Context failure is often a system-assembly problem, not simply a small-context-window problem

  • the reason context platform engineering is so important is it dramatically simplifies reaching maximum KV cache hit rates
    #104 — Val Bercovici, WEKAconfidence: high
  • connect the dots with graph technology and solve problems like context engineering
    #105 — Stephen Chin, Neo4jconfidence: high
  • irrelevant facts pollute memory.
    #218 — Daniel Chalef, Zepconfidence: high
  • LLMs and tools are orchestrated through predefined code paths.
    #193 — Chau Tran, Gleanconfidence: high
  • Agents look at the starting point, end point and try to provide you the results.
    #752 — Nupur Sharma, Qodoconfidence: high
  • the more the tools, the more issues you have.
    #752 — Nupur Sharma, Qodoconfidence: high

Durable state and workflow semantics are trust features, not backend details

  • once we get into longer running workflows, that's where it really becomes a problem.
    #99 — Samuel Colvin, Pydanticconfidence: high
  • no one's going to trust your agent.
    #167 — Preeti Somal, Temporalconfidence: high
  • the workflow orchestration layer needs to be deterministic. So it can be rerun um in a in a uh deterministic fashion
    #44 — Peter Wielander, Vercelconfidence: high
  • where I've got some big production CI stack to go and run and deployment takes hours, being able to go and change variables in production or in staging very quickly
    #657 — Samuel Colvin, Pydanticconfidence: high
  • you'll be able to assemble agent teams that can complete tasks orders of magnitude harder than what you can complete with a single agent today.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • minding the gap around observability.
    #680 — Amy Boyd & Nitya Narasimhan, Microsoftconfidence: high

Human oversight works best as an architectural layer, not an afterthought

  • There needs to be human interaction for approvals or other reasons and of course they need to be able to be uh able to run in parallel for efficiency
    #167 — Preeti Somal, Temporalconfidence: high
  • dial these agency dials far up.
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • maintaining a factory would require you to have an overview of the processes you want your coding agents to go through.
    #629 — Eric Zakariasson, Cursorconfidence: high

High-stakes systems tune agency instead of maximizing it

  • a binary thing but as a lever that you can dial
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • agentic workflows we can plan and execute
    #201 — Yogendra Miraje, Factsetconfidence: high
  • send it to me for approval.
    #202 — Rita Kozlov, Cloudflareconfidence: high
  • credentials, payments, and checkout require determinism.
    #745 — Steve Kaliski, Stripeconfidence: high

The harness is evolving from a local loop into a staged software factory

  • getting to a place where you can build your own like software factory
    #629 — Eric Zakariasson, Cursorconfidence: high
  • unified agent harness that will manage
    #632 — Vaibhav Srivastav & Katia Gil Guzman, OpenAIconfidence: high
  • parallel agents working together to fix
    #42 — Robert Brennan, OpenHandsconfidence: high
  • The difference with missions is that we run features serially.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • Our longest mission ran for 16 days
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • We just kind of gave each role its own kind of context window.
    #691 — Ash Prabaker & Andrew Wilson, Anthropicconfidence: high
  • it's no longer about the model or the agent. It's about the process.
    #743 — Vincent Koc, OpenClawconfidence: high

The context gap increasingly includes capability packaging and progressive disclosure

  • doesn't have to be loaded immediately to context.
    #683 — Pedro Rodrigues, Supabaseconfidence: high
  • specifically with progressive disclosure.
    #654 — Nick Nisi & Zack Proser, WorkOSconfidence: high
  • 49% reduction of the initial load.
    #625 — Sam Morrow, GitHubconfidence: high
  • rich interactive components that render directly in the chat.
    #747 — Marlene Mhangami & Liam Hampton, GitHubconfidence: high

AI-native advantage depends on organizational coherence, not output volume alone

  • you'll be able to assemble agent teams that can complete tasks orders of magnitude harder than what you can complete with a single agent today.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • observing their workflows, their pain points, co-designing solutions with them
    #693 — Eoin Mulgrew, 10 Downing Streetconfidence: high
  • maintaining a factory would require you to have an overview of the processes you want your coding agents to go through.
    #629 — Eric Zakariasson, Cursorconfidence: high

Harness quality now includes capability packaging, not only repo hygiene

  • That's what a skill is. You're teaching the the LLM how to do something in the way that you expect it to be done
    #654 — Nick Nisi & Zack Proser, WorkOSconfidence: high
  • This is how the agent is
    #683 — Pedro Rodrigues, Supabaseconfidence: high
  • 49% reduction of the initial
    #625 — Sam Morrow, GitHubconfidence: high
  • the schema is the UI for the agent.
    #744 — Michael Hablich, Google (Chrome DevTools)confidence: high

Context failure is often a capability-exposure problem, not only a retrieval problem

  • MCP versus skill debate
    #683 — Pedro Rodrigues, Supabaseconfidence: high
  • you can do it in a better way. And that is specifically with progressive disclosure.
    #654 — Nick Nisi & Zack Proser, WorkOSconfidence: high
  • grouping concept of related product
    #625 — Sam Morrow, GitHubconfidence: high

Evals are strongest when they are trace-linked and fed by production observability

  • what is the gap between agent observability and what you're actually building. How do we mind that gap?
    #680 — Amy Boyd & Nitya Narasimhan, Microsoftconfidence: high
  • we go from like a testing and eval paradigm to a monitoring p uh paradigm.
    #655 — Danny Gollapalli & Ben Hylak, Raindropconfidence: high
  • where I've got some big production CI stack to go and run and deployment takes hours, being able to go and change variables in production or in staging very quickly
    #657 — Samuel Colvin, Pydanticconfidence: high
  • download all of the UI that we have as a file system?
    #689 — Lawrence Jones, incident.ioconfidence: high
  • 25 agents in parallel
    #689 — Lawrence Jones, incident.ioconfidence: high
  • it's actually the telemetry that does that.
    #750 — Dat Ngo, Arizeconfidence: high

Coordination is the unsolved runtime primitive for multi-agent systems

  • the thing that's missing for me is coordination.
    #704 — Lou Bichard, Onaconfidence: high
  • through sort of state machines, you know, by building out workflows and effectively state machines
    #704 — Lou Bichard, Onaconfidence: high
  • They step on each other's changes. They duplicate work. They make inconsistent architectural decisions.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • we have the two agents basically negotiate what done actually means.
    #691 — Ash Prabaker & Andrew Wilson, Anthropicconfidence: high

Context engineering is a primary engineering discipline, not a prompt trick

  • picking up the right documents and answering those questions is a really cool use case.
    #100 — Ofer Mendelevitch, Vectaraconfidence: high
  • cool load generator that Kalen wrote that lets you configure agent swarms uh and agent subtasks with very specific SLOs's
    #104 — Val Bercovici, WEKAconfidence: high
  • connect the dots with graph technology and solve problems like context engineering
    #105 — Stephen Chin, Neo4jconfidence: high
  • the right agent in the future is going to be this system that decides what type of search
    #157 — Will Bryk, Exa.aiconfidence: high

RAG, memory, and GraphRAG solve different jobs; collapsing them into one bucket misses the architecture

  • rag or retrieval augmented generation where you have so many things that you can't fit them all in
    #48 — Jack Morrisconfidence: high
  • why you need to model your memory after your business domain.
    #218 — Daniel Chalef, Zepconfidence: high
  • the basic construct of a knowledge graph is um nodes which represent different people in the situation, relationships, and then you can attach properties to these nodes.
    #105 — Stephen Chin, Neo4jconfidence: high
  • we want to look at patterns for successful graph applications uh for um making LLMs a little bit smarter by putting knowledge graph into the picture.
    #215 — Michael, Jesus & Stephen, Neo4jconfidence: high
  • how can we create a graph rack system what are the advantages of it and if we add the hybrid nature to it how it is helpful
    #219 — Mitesh Patel, NVIDIAconfidence: high
  • you need to be like tuned to what what every technique gives you before you go and invest in it.
    #156 — David Karam, Pi Labsconfidence: high
  • retrieval is not just vector search.
    #756 — Kuba Rogut, Turbopufferconfidence: high

Once an AI system can act autonomously, bounding its authority becomes the price of deployment

  • we're asking AI systems to now produce output and produce judgments and decisions
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • most primitives the magic happens when you combine these things together
    #138 — Sam Bhagwat, Mastra.aiconfidence: high