Chapter 6 — Runtimes, State, and the Human Control Plane
A chatbot can get away with amnesia. A production agent cannot.
Short-lived assistance can live inside a conversational loop — a turn comes in, a turn goes out, and nothing has to survive past the reply. Delegated work cannot live there. The moment a system has to persist across retries, timeouts, approvals, multiple tools, and possibly many parallel workers, the central problem stops being next-token cleverness and becomes execution semantics. The system has to preserve state, survive interruption, expose its progress, and resume without losing the thread. This is the chapter where the book's argument turns from what the model knows to how the work itself stays alive between the moments the model is thinking.
Samuel Colvin at Pydantic states the lesson in the voice of someone who learned it in production: "building production AI agents reveals a harsh truth — stateless architectures that work for simple demos become impossibly painful at scale." The word painful is doing honest work. The failure is not dramatic. It is the slow accumulation of edge cases — the timeout that loses an hour of work, the retry that double-charges, the approval that arrives after the agent already acted — that turns a magical demo into a system nobody trusts.
Why stateless demos break
The seductive thing about a chat-loop agent is that it works immediately. You wire a model to some tools, let it loop, and on a short, happy-path task it looks like the future. The trouble is that the architecture has no memory of what has actually happened — only a transcript of what was said. And a transcript is not state.
The distinction matters the instant anything goes wrong. A real delegated task runs long enough to hit a timeout, a rate limit, a flaky tool, a network blip, a required human approval. In a stateless loop, every one of those is a small catastrophe, because there is no durable record of where the work had gotten to. The system cannot answer the only questions that matter under failure: what has been done, what is in flight, and what is safe to retry. It can replay the conversation, but it cannot resume the work, because it never modeled the work as something separate from the conversation about it.
This is why so many impressive agent demos collapse when a team tries to operationalize them. The model is capable, the prompts are decent, the context is strong — and the system still falls over, because it was built like a chat session when the job required a workflow. So run the test before you scale a demo: ask whether the system can answer what has been done, what is in flight, and what is safe to retry without replaying the conversation. If the only record is a transcript, you have a chat session, not a workflow — and the fix is not a smarter model but a different architecture underneath.
Durable execution is the runtime requirement
The architecture that survives reality is durable execution: treat the long-running task as structured, checkpointed execution rather than a growing transcript. Preeti Somal at Temporal puts the stakes plainly: agentic systems "must scale and provide durability and reliability — otherwise, no one's going to trust your agent." Trust, in her framing, is not a property of the model's answers. It is a property of the runtime's behavior under failure.
What durability buys is the ability to distinguish what was said from what has actually happened. A records each completed step, so that when a tool call times out on the ninth step of a twelve-step task, the system resumes from the ninth step rather than restarting from zero or, worse, redoing a side effect that already fired. Somal describes pushing the reliability semantics down into the runtime so they leave the prompt entirely: "nowhere in there will there be statements like, if something fails keep retrying it — all of those pieces are handled" by the execution layer. That is the architectural move. Retries, timeouts, and resumption stop being things the agent has to reason about turn by turn and become guarantees the runtime provides. The prompt gets to be about the task; the runtime takes care of survival.
The same talk names the second thing durability gives you, and it points straight at the next section: "we also store all of the workflow history, so that you can look at the visibility of what is happening as your agent is navigating this complex set of interactions." History is not just for recovery. It is for inspection — and inspection is where humans re-enter the system.
The human control plane is an architectural layer
It is tempting to treat human oversight as a temporary crutch — something you keep around until the models get good enough to remove it. The corpus argues the opposite. In high-value systems, human control is not a phase on the way to full autonomy. It is a permanent architectural layer, designed in rather than bolted on.
Joel Hron at Thomson Reuters gives the clearest frame for what that layer regulates. Autonomy, he argues, is not a switch you flip but a dial you tune: agency is "not a binary thing but a lever that you can dial" up or down depending on how irreversible, risky, and observable the work is. A low-stakes draft can run with the dial turned far up. A filing with professional consequences runs with it turned down, with explicit approval points where a human re-enters before the system acts. The design question is never "autonomous or not." It is "how much agency, at which steps, with what review."
The mechanism that makes those approval points workable is exactly the workflow history durability provides. Hron describes deep-research systems whose long-running behavior becomes "the trajectories that the model would be following along its path of answering this particular type of legal question" — inspectable paths a human can audit rather than opaque jumps from question to answer. The control plane is built from these surfaces: the approval gates where a human authorizes the next step, the roll-up views that show what the system is doing without drowning the reviewer in raw agent chatter, the trajectory and history records that make an action reconstructable after the fact. Eric Zakariasson at Cursor describes the roll-up form of this directly — the human needs "an overview of the processes," a single surface answering what every worker is doing and what actually needs a person's attention, rather than a firehose of individual logs.
This is the same human-judgment-at-the-edges principle the book keeps returning to, now given a concrete home in the runtime. The point of the control plane is not to slow the system down. It is to focus scarce human attention on the consequential moments and let the runtime carry everything else — which gives you a design test for any review surface: if it surfaces raw agent chatter instead of a roll-up of what needs a person, or if it gates a reversible low-stakes step the way it gates an irreversible one, it is spending human attention in the wrong place.
Parallelism raises the stakes on coordination
Everything so far concerns a single durable worker. The architecture gets harder, and more interesting, the moment teams reach for many workers at once — and this is where the corpus is most unsettled.
The appeal is obvious: if one agent is leverage, a fleet is more leverage. OpenAI's Codex team describes the mechanism, spinning off "a master task into decomposable parallel and independent tasks." But the precondition is in that phrase: parallelize only work that is genuinely decomposable and independent, and only after you have a way to recompose, inspect, and route the output back to a human at the right moment. Add workers before you have that, and parallelism just manufactures chaos faster — more diffs, more conflicts, more review than anyone can absorb, which is exactly the alignment-debt failure Chapter 9 will name at organizational scale. Teams go wrong treating "spin up more agents" as the win when the recomposition layer is the actual bottleneck.
Lou Bichard at Ona sharpens the diagnosis to a single missing piece. The runtime, he argues, is solved — "there are many options for this now, sandboxes and containers" — and so are triggers and orchestration. "The thing that's missing," he says, "is coordination": the agent-native primitive that lets parallel workers pick up tasks, signal completion, and hand off without a human stitching them together. And he is pointed about what is not that primitive: "GitHub is not a coordination layer for agents — it gets incredibly overwhelming." His candidate building blocks are exactly this chapter's subject — "state machines, by building out workflows and effectively state machines" — which lands the parallelism question squarely back on durable execution.
What makes the corpus credible here is that the teams shipping production multi-agent systems have not agreed on an answer; they have each substituted a known mechanism for the missing one. Factory runs features serially with one active writer — "serial execution with targeted internal parallelization" — eliminating the coordination problem by construction, and reports a longest mission of sixteen days. Anthropic's long-running agents take a third path: a planner-generator-evaluator loop where each role gets "its own kind of context window" and the agents "negotiate what done actually means" through a contract written to files on disk before any code is produced — a capability curve their team traces from roughly a one-hour autonomous run to twelve hours on the same simple . Serial execution, file-based contracts, state machines plus durable execution: three substitutes for one primitive that does not yet exist. The honest chapter names the gap and shows the three things teams actually ship, rather than pretending there is a consensus building block.
The runtime is where intelligence becomes dependable
Pull the chapter together and its is structural. A machine colleague is not a model with tools attached. It is a model inside an operating environment — and that environment is what determines whether bursts of intelligence become dependable delegated work.
The environment has named parts now. Durable execution, so the work survives interruption and resumes instead of restarting. Explicit state and workflow history, so the system can answer what has actually happened. A of approvals, roll-up views, and inspectable trajectories, so people supervise at the consequential edges. And, once there are many workers, a coordination story — even if today that story is a chosen substitute rather than a solved primitive. None of these live in the model. All of them live in the runtime, and all of them are the difference between a demo and a system.
The moment a durable, long-running agent can act on its own — with state, tools, and the authority to use them over time — bounding that authority becomes the price of letting it act at all. Identity, permissions, sandboxes, audit: that is the next chapter's subject. Durability gave the agent staying power. Security decides what it is allowed to do with it.
What to do with this
- Before you scale any agent demo, run the failure-state test: can the system answer what has been done, what is in flight, and what is safe to retry — without replaying the conversation? If the only record is a transcript, you have a chat session, not a workflow. The fix is a different architecture, not a smarter model.
- Push retry, timeout, and resumption semantics down into the runtime so they leave the prompt entirely — aim for a state where "if something fails keep retrying it" appears nowhere in your prompt because the execution layer handles it. Checkpoint each completed step so a tool timeout on step nine resumes from step nine, not from zero and not by re-firing a side effect that already ran.
- Store full workflow history, and use it for two distinct jobs: recovery (resuming after failure) and inspection (a human auditing what the agent actually did). Treat the trajectory record as the surface humans review, not an afterthought.
- Stop asking "autonomous or not" and instead set the agency dial per step — tuned by how irreversible, risky, and observable that step is. Turn it up for a low-stakes draft; turn it down with an explicit approval gate before an action with real consequences.
- Audit your review surfaces: if one shows raw agent chatter instead of a roll-up of what needs a person, or gates a reversible step as strictly as an irreversible one, it is spending scarce human attention in the wrong place. Build the roll-up overview that answers what every worker is doing and what actually needs review.
- Don't reach for more parallel workers until you have a recomposition layer — coordination is the missing primitive, and GitHub is not it. Until an agent-native coordinator exists, pick a deliberate substitute: serial execution with one active writer, file-based "negotiate what done means" contracts, or state machines plus durable execution.