Runtimes and the Control Plane — From Copilot to Colleague

CHAPTER 06/4,005 words/Drafting

Source-anchored claims

1The important transition is from suggestion to delegated execution
2Chat is an insufficient control surface for long-running or high-stakes work
3Reliability comes less from model cleverness than from surrounding scaffolding
4Harness quality is a major determinant of coding-agent quality
5Specs are not paperwork; they are executable intent
6Evals are a control system, not just a test suite
7Realistic evals must be grounded in natural tasks and operational history
8Context failure is often a system-assembly problem, not simply a small-context-window problem
9Durable state and workflow semantics are trust features, not backend details
10Human oversight works best as an architectural layer, not an afterthought
11High-stakes systems tune agency instead of maximizing it
12The harness is evolving from a local loop into a staged software factory
13The context gap increasingly includes capability packaging and progressive disclosure
14AI-native advantage depends on organizational coherence, not output volume alone
15Harness quality now includes capability packaging, not only repo hygiene
16Context failure is often a capability-exposure problem, not only a retrieval problem
17Evals are strongest when they are trace-linked and fed by production observability
18Coordination is the unsolved runtime primitive for multi-agent systems
19Context engineering is a primary engineering discipline, not a prompt trick
20RAG, memory, and GraphRAG solve different jobs; collapsing them into one bucket misses the architecture
21Once an AI system can act autonomously, bounding its authority becomes the price of deployment
22The gap that kills agent PoCs is the evaluation gap — no defined, continuously-measured definition of success — not the choice of model
23Route each task to the cheapest model that can do it — tiered model selection by difficulty is accepted practice, not a frontier idea
24Trustworthy judgment can be manufactured from cheap stochastic generation — sample-and-vote, multi-model consensus, and debate panels beat a single expensive call
25Parallel agents need per-agent runtime isolation — a sandbox/micro-VM/worktree each — because containers are not a sufficient boundary for agent-generated code
26Input tokens dominate agent cost — fix what you feed the model before you optimize which model

A chatbot can get away with amnesia. A production agent cannot.

That difference is not philosophical. It is architectural. A chat system can answer a question, emit a patch, suggest a draft, and disappear. But the moment you ask a system to do work that unfolds across time, tools, failures, and approvals, the center of gravity moves. The question is no longer only whether the model can produce a smart next token. The question is whether the surrounding system can preserve intent, survive interruption, recover from error, expose its progress, and stop in the right places for human review.

This is where many impressive agent demos break. The model itself may be good enough. The may be decent. The evals may even exist. The context may be strong. But the system was still built like a conversation when it needed to be built like a workflow. It loses track of what already happened. A retry repeats work or performs the same action twice. A human cannot tell which subagent did what. An approval arrives too late, after the expensive or risky step already happened. The agent does not fail because it is unintelligent. It fails because it has nowhere durable to stand.

The next layer is runtime design. Once agents act over time, architecture becomes destiny.

Stateless systems hit a wall

Figure 06.1/Transcript vs workflowCLICK TO ENLARGE

The easiest way to understand the runtime problem is to notice how much modern agent discourse still inherits its assumptions from chat. Chat is an excellent interface for short-lived assistance. It is forgiving. It is intuitive. It lets a user redirect the system turn by turn. For many workflows, that is enough. But chat history is a weak execution substrate for delegated work. A transcript is not the same thing as state. It does not cleanly represent task progress, pending approvals, completed tool calls, rollback boundaries, or which intermediate outputs are binding versus disposable.

Samuel Colvin states the break point simply: “Once we get into longer running workflows, that’s where it really becomes a problem.” The line matters because it does not stateless systems are always bad. It they hit a wall when work acquires duration. A short answer can be regenerated. A half-finished research trajectory, a partially executed software task, or a multi-step legal workflow cannot be managed so casually.

This is the same shift the book has been tracing from the beginning. In a toy setting, you can still tell yourself the model is the product. In a real system, the surrounding structure becomes inseparable from the capability. Preeti Somal gives the trust version of the same point: agent systems “must scale and provide durability and reliability. Otherwise, no one’s going to trust your agent.” That is the operating condition for delegation, not a platform engineer’s hobbyhorse.

Trust fails quickly when continuity fails. A coding agent that loses its place after every interruption is not a colleague. It is an intern with total amnesia. A research agent that cannot resume after a timeout is brittle, not autonomous. A support or legal workflow that cannot survive approvals, waiting periods, or tool outages is not production-ready no matter how eloquent the underlying model sounds in a demo. Durability, then, is the runtime expression of seriousness, not extra credit.

The software factory needs an operating system

Figure 06.2/The human control planeCLICK TO ENLARGE

Meridian's case from Chapters 3 and 4 becomes even more revealing here. The team already rebuilt its repo after the slop era, and after the admin-override regression it added the eval slices that now catch the special-path mistakes a casual test run would miss. Small delegated work now goes well. Then the team raises the ambition again. Instead of isolated patches, it asks the system to investigate a bug, spawn a few subagents, inspect a cluster of files, propose a fix, run checks, and prepare a reviewable summary for a human.

This is where a second class of problems appears. One subagent finds the relevant failure but another, working off a slightly older branch of understanding, proposes a different patch. A retry of the validation step reruns something the first attempt had already completed. The human reviewer receives fragments of work rather than a coherent roll-up. The system still has intelligence and context, but no stable execution semantics. It is a workshop full of talented workers without a foreman’s board, without station history, and without a clean shift handoff.

That is the deeper meaning of the software-factory metaphor. A factory is not only a prepared environment and a quality system. It also needs an operating system. It needs durable task identities, queues, checkpoints, resumability, visibility, and clear places for review. Otherwise increasing the number of workers only multiplies confusion.

This is why the runtime chapter naturally belongs beside the chapter rather than floating off into platform taxonomy. without runtime semantics are fragile. The repo may be legible, the tasks may be specified, the standards may be measured, and the context may be well assembled. But if the work itself cannot persist and be supervised, the colleague illusion still breaks the first time the system has to keep going after the first clever turn.

Agentic systems are workflows with state

Figure 06.3/Subagents need recompositionCLICK TO ENLARGE

A lot of debate about agents versus workflows turns out to be a category error. People sometimes speak as if workflows are rigid and agents are flexible, so choosing workflows means giving up on real agency. In production systems, the opposite lesson often emerges. Workflow structure is what makes useful agency survivable.

Somal gives the chapter its best backbone here: “At the core of agentic AI applications is a complicated workflow... [that] needs to handle state potentially over long periods of time. There needs to be human interaction for approvals...” That sentence should kill the fake dichotomy. The system does not become less agentic because it has durable workflow semantics. It becomes more usable.

Useful agentic systems are not free-floating intelligence. They are stateful workflows with probabilistic decision nodes.

That framing clarifies a lot at once. It explains why pause and resume matter. It explains why retries should not live in ad hoc prompt logic. It explains why approvals belong naturally inside execution rather than as awkward afterthoughts. It explains why application state cannot be reduced to whatever is still visible in the prompt window. And it explains why runtime tooling increasingly looks closer to distributed-systems infrastructure than to prompt folklore.

In a serious coding workflow, state may include the current task plan, completed tool runs, validation status, pending questions for the human reviewer, and links to specific artifacts the agent produced. In a high-stakes professional workflow, it may include evidence bundles, validation checkpoints, unresolved exceptions, approval boundaries, and which output is ready for expert sign-off. In both cases, the core requirement is the same: the agent needs a structured memory of work, not merely a growing transcript of conversation. Durability lets the system preserve the difference between “what was said” and “what has happened.”

History is part of execution, not just debugging

Figure 06.4/Agency is a dial, not a switchCLICK TO ENLARGE

Once you start thinking in workflows rather than turns, history changes meaning, and a practical rule follows from the change: persist a structured record of what has happened, not a growing transcript of what was said. In chat systems, history is mostly there to help the next answer feel continuous. In durable systems, history is part of execution itself. It tells the runtime what has already happened, which steps can be retried safely, which approvals were granted, what state changed, and where the agent should resume — none of which a transcript, by itself, can represent.

That is why durable-agent discussions keep converging on structured histories, checkpoints, and replayable event logs. Not because engineers enjoy complexity, but because long-running work creates obligations. If the system did something important, someone may later need to inspect it. If a run failed halfway through, the team may need to resume from a meaningful boundary rather than start from zero. If a result is contested, the organization may need to know what the system saw, which tools it used, and which step introduced the mistake.

Somal makes this visibility requirement explicit: “We also store all of the workflow history... so that you can look at the visibility of what is happening as your agent is navigating this complex set of interactions.” History is the substrate of inspection, not archival fluff.

This is also where runtime design begins to touch Chapter 4’s control-system argument. A good history lets a team do more than recover execution. It lets them learn. Failed trajectories become eval cases. Slow steps become optimization targets. Repeated approval bottlenecks reveal design problems in the control plane. The runtime is not merely keeping the work alive but generating the evidence by which the system can later improve.

Replay, snapshot, and the shape of continuity

Once durability becomes a real concern, a more technical tradeoff appears: how exactly should continuity be represented? One family of systems leans on replay. Preserve an event history, then reconstruct state from what happened. Another family leans on snapshots. Save checkpoints of working state so execution can continue more directly. Both approaches are reasonable. Both reveal something about what the team values.

Replay-oriented designs are attractive when causality and auditability matter. They preserve a strong sense of how the system got here. They make it easier to reason about the chain of events. They fit environments where exact reconstruction is important and where state should emerge from recorded steps rather than from opaque frozen blobs.

Snapshot-oriented designs are attractive when fast continuation and complex live state matter more. They reduce the cost of resuming. They can feel more natural when the system’s working memory is elaborate, when rebuilding everything is awkward, or when pause-and-resume is expected to be frequent.

This is less a taxonomy lesson than a decision with a rule inside it: reach for replay when causality and auditability are the point, so state emerges from recorded steps rather than opaque blobs; reach for snapshots when fast continuation and elaborate live state dominate, so pause and resume stay cheap. The existence of that tradeoff is what proves runtime semantics are not incidental details. Once agents operate over time, teams are making the kinds of decisions mature distributed systems always have to make: what gets persisted, what gets recomputed, what must be auditable, what can be resumed cheaply, and which failure modes are acceptable.

The human control plane is an architectural layer

This is where the chapter’s title concept should crystallize. A recurring mistake in agent discourse is to treat human involvement as a temporary crutch on the way to full autonomy. But the more capable systems become, the less persuasive that framing looks. In valuable systems, human control is not a leftover from immaturity. It is an architectural layer.

The is the set of interfaces, approvals, visibility layers, and intervention mechanisms through which people supervise delegated machine work. It is the place where a person can see what is active, understand what happened, inspect evidence, redirect a task, approve a risky transition, or teach the system something reusable.

That means a chat transcript is usually not enough. Operators need queue views, roll-up summaries, pending-review surfaces, uncertainty cues, state inspection, and clean intervention points. They need something closer to a control room than a message thread. If the only way to supervise a complex agent system is to read back through thousands of tokens and manually reconstruct what happened, then the control plane does not exist yet.

Eric Zakariasson’s line is one of the cleanest expressions of the problem: “Here’s what everyone is working on... and here’s what you as a human need to review.” That is the control plane in plain English. Not omniscient micromanagement, but selective visibility into a fleet of delegated work.

Maggie Appleton sharpens the same point from the collaboration angle. The missing thing is not merely more autonomous workers but a shared space in which plans, context, intermediate work, and review can be coordinated collectively. The challenge is no longer only model reasoning but organizational legibility. The ties together execution, observability, oversight, and team coordination under one idea: make supervision operationally cheap enough that humans can stay above the loop without vanishing from it.

Human control is not human micromanagement

The phrase human-in-the-loop can accidentally trivialize the design problem. It can suggest a binary choice: either humans approve everything, or the system is autonomous. The more useful reality is a gradient of control, and it turns the design task into a placement question — not how to keep a human in every loop, but where to put the few checkpoints that carry the most judgment. Humans may stay out of the way for low-risk steps, review plans before expensive ones, approve external actions, inspect only exceptions, or intervene only when uncertainty spikes. Control can sit before, during, and after execution.

A well-designed control plane should reduce the need for constant rescue, not institutionalize it. The goal is not to make every system depend on manual babysitting. The goal is to create high-leverage checkpoints where human judgment matters most.

A coding factory, for example, might let subagents explore, search, summarize, draft, and run validations autonomously, while reserving merge decisions, large architectural changes, or dependency additions for review. A high-stakes professional workflow might allow autonomous evidence gathering and draft assembly, while requiring expert sign-off before client-facing output or consequential recommendations. In both cases, the right design question is not “How do we keep the human involved everywhere?” It is “Where is the human most valuable?” That is a control-plane question, not a prompt question.

Attention is not the only scarce resource the control plane rations. Compute is the other: match the cost of the response to the difficulty of the request rather than paying frontier prices per step. Laurie Voss at Arize states it almost prescriptively — use “cheap models for simple queries and expensive models in your agent ... for complex queries” — and Harrison Chase at LangChain describes a router whose job is to “route between ... language models.” The platforms expose the same trade as service tiers; Guillaume Vernade at Google DeepMind describes a flex tier that gives “a 50% discount but your request can be ... delayed.” Which model runs a given step is a control-plane decision, not a global default chosen once. Routing adds its own failure surface, though — a misroute hands a hard task to a cheap model that quietly botches it — so aggressive routing is safe only behind the verification the control plane already runs. Route down to the cheapest model that still passes the eval, and no cheaper.

High-stakes systems tune agency instead of maximizing it

The High-Stakes Colleague case makes this point unavoidable. In legal, tax, compliance, healthcare, and similar workflows, the dream of unrestricted autonomy becomes less impressive the closer you get to real operational risk. The system is valuable not because it can do everything without supervision, but because it can do the right things with the right boundaries.

Joel Hron offers the right antidote to autonomy maximalism, arguing that agency is best thought of as a spectrum — a set of dials adjusted by use case. That framing matters because it replaces the childish question — how autonomous can we make it? — with the adult one: where should autonomy be high, where should it be low, and who decides? That difference is foundational to trustworthy product design.

The north star, as Hron puts it, has shifted “from helpfulness to productive.” But productive does not mean unsupervised. In high-stakes work, productivity often depends on carefully staged authority. The system may be allowed to gather evidence, route across tools, synthesize findings, and even validate parts of its own work. But certain boundaries remain deliberately human. An approval is not evidence that the system failed. It is evidence that the organization understands where risk actually lives.

This is another reason Chapter 6 should pair the with the High-Stakes Colleague. The same control-plane principle appears in both, even though the surface domain is different. In software, a human may review the patch before merge. In professional services, a human may review the trajectory before the conclusion is accepted. In both cases, adjustable autonomy is the runtime expression of trust.

Legacy systems become runtime components

One of the most practical ideas in the High-Stakes Colleague material is that old systems are not just obstacles to agentic work. They often become runtime components of the new control plane.

Hron points out that existing validation engines can be repurposed as tools the AI system uses to inspect and correct its own work — in the same firm that, in Chapter 5, learned to rank a matter note above a public explainer, provenance and validation now become tools the system itself calls. That is a powerful pattern because it shows how durable execution changes the role of traditional enterprise software. Systems that once only served human operators now become structured checkpoints, rule engines, and verification layers inside a machine-mediated workflow.

The chapter should linger on this because it helps demystify agent architecture. Not every trustworthy agent system is built from scratch as a magical new organism. Often it is assembled from older, more stable parts: permission systems, validators, databases, workflow engines, audit trails, search layers, review queues. The model is the volatile component. The rest of the runtime is what prevents volatility from becoming operational chaos.

That is also why runtime design is inseparable from organizational design. As soon as an agent can call the old validation engine, write into the old workflow record, and surface outputs to the old reviewer queue, the boundary between “AI system” and “business process” starts to collapse. The runtime becomes the place where those worlds meet.

Observability is part of the control plane

None of this works if the system is opaque. Classic monitoring tells you whether a service is up, slow, or erroring. has to answer a different kind of question: what did the system believe it was doing, what did it actually do, where did it drift, and what should a human now inspect?

is not merely a nicer logging story but what makes the real. Humans cannot steer what they cannot see.

Good agent traces capture plans, tool calls, state changes, intermediate outputs, timings, and boundaries between durable steps. They should support two levels at once: deep inspection of a single trajectory and roll-up supervision across many concurrent tasks. The first helps engineers debug strange failures. The second helps operators manage a fleet.

This is where Chapter 4’s line from Phil Hetzel keeps paying off: observability and eval are often the same problem from a systems perspective. In Chapter 6 the becomes more concrete. The runtime records the trajectory. Observability renders it legible. The control plane decides where humans inspect it. Evals later mine it for reusable lessons. One layer feeds the next.

There is also an honest tension here the chapter should keep visible. Richer traces increase trust, debuggability, and governance capacity. They also increase privacy, retention, and security risk. The answer is not to avoid observability but to design it consciously: redaction, selective retention, risk-based views, and different surfaces for debugging versus audit. Even here, the control plane is doing governance work.

Parallel workers create leverage only if work can be recomposed

The final runtime lesson is about subagents. Parallel workers are compelling because they offer the same thing every manager has wanted forever: more throughput. OpenAI’s subagent materials and the coding-factory case both point toward a future where one human can launch many narrow specialists at once. Searcher, implementer, reviewer, summarizer, debugger, policy checker, migration scout. The leverage is real.

But subagents do not solve the control problem. They intensify it.

More workers mean more intermediate artifacts, more opportunities for duplicated effort, more state to coordinate, and more need for roll-up visibility. Parallelism without recomposition is just chaos at higher speed. The key design challenge is not how to spawn more workers but how to merge, compare, inspect, and route their outputs so that the human remains oriented.

Independence also has to be enforced by the environment, not merely declared in the task split. Several agents pointed at one shared dev setup collide on the same branch, ports, and database — one agent's migration breaks the others mid-run — so each worker needs its own isolated, ephemeral environment. Maggie Appleton at GitHub gives each session one “backed by a micro VM ... a sandboxed computer in the cloud on its own Git branch,” which is what lets a developer “work on parallel tasks and instantly switch between them.” And the obvious primitive is the wrong one: Rene Brandel at Casco warns that “if you just use containers ... that's not an isolation layer,” because agent code can get root and move laterally. A git worktree suffices for trusted edits; untrusted, side-effecting agent work wants a VM.

Lou Bichard at Ona sharpens the diagnosis to a single missing piece. The runtime, he argues, is solved — “there are many options for this now, sandboxes and containers” — and so are triggers and orchestration. “The thing that's missing,” he says, “is coordination”: the agent-native primitive that lets parallel workers pick up tasks, signal completion, and hand off without a human stitching them together. He is pointed about what is not that primitive: “GitHub is not a coordination layer for agents — it gets incredibly overwhelming.” His candidate building blocks are this chapter's subject — “state machines, by building out workflows.”

This is why the best visions of multi-agent work keep converging on planning boards, supervisor views, task decomposition layers, and explicit review queues. They are not administrative extras but the infrastructure that lets parallelism produce leverage rather than entropy.

The teams shipping production multi-agent systems have not agreed on an answer; each has substituted a known mechanism for the missing one. Factory runs features serially with one active writer — “serial execution with targeted internal parallelization” — eliminating the coordination problem by construction, and reports a longest mission of sixteen days. Anthropic's long-running agents take a planner-generator-evaluator path where each role gets “its own kind of context window” and the agents “negotiate what done actually means” through a contract written to files on disk before any code is produced. Serial execution, file-based contracts, state machines plus durable execution: three substitutes for one primitive that does not yet exist.

The shows the coding version of this. The High-Stakes Colleague shows the professional-services version. In both, the real question is the same: when many machine workers are active, where does coherent human judgment re-enter the system? That place is the control plane.

The runtime is what turns intelligence into dependable work

The real challenge of agentic systems is not producing one intelligent response but sustaining useful action across time without losing control. That is a runtime problem.

Durable state, explicit workflow semantics, structured approvals, inspectable histories, observability, and reviewable roll-ups are not secondary implementation details. They are the machinery that turns bursts of model intelligence into dependable delegated work. Without them, the system remains trapped in the demo layer: locally impressive, globally fragile.

This is the deeper continuity across the book’s middle run. Chapter 3 argued that delegated work needs a legible workplace. Chapter 4 argued that it needs a quality loop. Chapter 5 argued that it needs the right working set of information. Chapter 6 adds that none of this is enough if the work cannot persist, recover, and be supervised over time.

A machine colleague is not just a model with tools. It is a model inside an operating environment.

And the better that operating environment gets, the less the future of AI engineering looks like chat and the more it looks like building dependable systems for shared human-and-machine work.

Runtimes, State, and the Human Control Plane