← All chaptersFrom Copilot to Colleague

Chapter 09 · 22 min read

The AI-Native Organization

How teams and incentives change when AI becomes part of the workforce.

Experience in 3D
EVIDENCE OF SOURCE · CHAPTER 09
CHAPTER 09/5,048 words/Drafting

Chapter 9 — The AI-Native Organization

Most AI adoption stories are too small to be the real story. A few engineers get faster. Support summarizes tickets more quickly. A product manager drafts specs in half the time. Those are real gains, and they are also, every one of them, individual gains. Add them up and you have a company that uses AI. You do not yet have an .

The difference shows up on a Monday morning. Picture a company that has already moved past casual adoption. Over the weekend, agents opened pull requests. Product generated three new onboarding flows. Support drafted fixes for a backlog of tickets. Internal automations touched the billing system, the CRM, and the docs. Everyone — and everything — was productive. And the organization walks in on Monday to discover that its problem is no longer a shortage of output. It is a shortage of coherence.

That is the shift this chapter is about. AI does not simply make an organization faster. It moves where the scarce thing lives. When execution gets cheap, the bottleneck stops being production and becomes judgment, prioritization, review, and the design of throughput itself. The companies that win the next decade will not be the ones that bought the most seats. They will be the ones that redesigned the operating model so that cheap generation turns into trusted work instead of expensive noise.

AI-native is an operating model, not a purchase

Figure 09.1/Where scarcity movesCLICK TO ENLARGE

The cleanest way to see the gap between "uses AI" and "AI-native" is to look at what happens at the threshold of full adoption. Dan Shipper, building the AI-native company Every, puts the discontinuity in numbers: "There is a 10x difference between an organization where 90% of engineers use AI versus one where 100% do." The is a deliberate provocation, not a measured figure — the steepness is the point, and the more sober large-scale evidence comes a few paragraphs on. The last ten percent is not a rounding error. It is the difference between AI as a personal productivity tool and AI as a medium the whole organization works in.

The reason is that partial adoption keeps the old workflows intact. If nine in ten engineers use agents but the tenth does not, every process still has to accommodate the human-only path — the review that assumes a person wrote the code, the handoff that assumes a person is on the other end, the planning that assumes work moves at human speed. The workflow cannot be rebuilt around delegation until delegation is universal, which makes the holdout the constraint, not a rounding error: as long as one path assumes human-speed work, you cannot rebuild the surrounding process around delegation, so the gain from the other ninety percent is capped at faster people rather than a faster company. The lever, then, is not buying the ninety-first seat but closing the last human-only path. At ninety percent, you have faster people inside an unchanged company. At a hundred, the company itself can change shape.

That reframing — from procurement to operating-model redesign — is what separates the field reports that sound transformational from the ones that sound like tool reviews. Barr Yaron's 2025 AI Engineering Report, surveying the state of practice across the industry, reads less like a list of which tools won and more like a map of which organizational habits are forming. The teams in the "from hype to habit" cohort — the ones building an AI-first company while still shipping a roadmap — describe the same arc: the early win is individual speed, and the durable win only arrives when the surrounding work is rebuilt to assume that speed.

This is why "we rolled out AI" and "we became AI-native" are different sentences with different price tags. The first is a budget line. The second is a redesign of how work is created, reviewed, and trusted — and it is the redesign, not the rollout, that produces the order-of-magnitude difference Shipper is pointing at.

Cheaper execution moves scarcity up the stack

Figure 09.2/Alignment debtCLICK TO ENLARGE

The book has argued since its opening chapters that when code gets cheap, judgment gets expensive. Chapter 1 made it a about the shift from suggestion to delegation; Chapter 2 made it a about taste. At the organizational scale, it becomes a about where the bottleneck physically sits.

When a single engineer can direct several agents at once, the constraint stops being how fast the team can produce and starts being how fast the team can decide what is worth producing and confirm that what got produced is correct. Justin Reock, working on engineering leadership at DX, frames the leadership job in exactly these terms: the manager's role shifts from allocating production capacity — which is suddenly abundant — to allocating judgment and attention, which stay scarce. The practical test for a manager is to ask which scarce resource each ritual rations. A standup that reports how much got produced is rationing the abundant thing; a standup that surfaces which decisions are unmade and which output is waiting on review is rationing the scarce one. The org chart was built to ration the wrong resource — and the wrong choice is to keep optimizing throughput when the queue that is actually backing up is judgment.

Here has to stay honest, because this is precisely the place where AI hype outruns the evidence. The large-scale studies are more sober than the demos. Yegor Denisov-Blanch's Stanford research, drawn from data on a hundred thousand–plus developers, finds that AI's productivity effect is real but uneven — strong in some task types and codebases, near zero or negative in others, and reliably overstated when teams measure activity instead of outcomes. Some AI-generated work creates rework that quietly eats the gain. The responsible reading is not "AI makes everyone faster." It is "AI changes the distribution of where time goes, and a naive measure will misread motion as progress."

That nuance matters for org design because the wrong metric, applied to cheap execution, actively destroys value. Nick Arcolano's analysis at Jellyfish, built on a dataset of some twenty million pull requests, shows the failure mode at scale: output volume rises, the activity dashboards light up green, and the actual constraint — whether the organization can review, integrate, and trust all that output — goes unmeasured until it breaks. The failure to name is the green dashboard: a count of PRs opened, commits, or lines generated is an activity metric, and in a world where artifacts are cheap, an activity metric measures motion, not progress. The fix is to instrument the outcome instead — rework rate, the share of generated work that ships unreverted, time spent in the review queue — so the metric tracks the resource that actually moved. The scarce resource moved; the measurement did not.

The five levels of AI-native maturity

Figure 09.3/Broaden who createsCLICK TO ENLARGE

The chapter so far has been arguing that AI-native is a destination, not a toggle. But organizations ask a more practical question: where are we now, and what does progress actually look like? The field has accumulated enough field reports, engineering team case studies, and post-mortems from large-scale enterprise deployments to answer that question more precisely than "early versus late."

The pattern that emerges across the corpus is a five-level progression defined not by tooling choices but by what the organization has internalized. Each level has a characteristic bottleneck — the constraint that has to break before the next level becomes accessible.

L0: Prompt-era. AI is a personal tool used by some engineers for some tasks. There are no shared policies, no conventions, no measurement. The bottleneck is access and awareness: not everyone has the tools, and no one has thought about what to do with them collectively.

L1: Assisted. Most engineers have access and use AI routinely for code generation, summarization, and local productivity tasks. But the work product — the PR, the spec, the ticket — enters the same review process it always did. AI makes people faster inside an unchanged workflow. The bottleneck is the detection gap: the organization cannot tell which outputs were AI-assisted, which means it cannot calibrate review accordingly. Activity metrics rise; outcome metrics are unmeasured.

L2: Augmented. The organization has started to treat AI as a participant in the workflow rather than a personal accelerant. There are shared conventions, early evals, or some agreement about what AI-assisted work looks like before it enters review. The integration is fragile and uneven — some teams have it, others do not — but the bottleneck has shifted from access to consistency. The question changes from "do we use AI?" to "do we use it the same way?"

L3: AI-native. Delegation is universal; the last human-only path has been converted. The workflow has been redesigned around the assumption that execution is cheap: review is built as a system, not a heroic act; alignment happens before the agent fan-out; creation is open to non-engineers through hardened, tested paths. The bottleneck is no longer the workflow — it is the measurement. The organization knows what it is producing but still struggles to know whether it is producing the right things at the right quality.

L4: AI-first. Institutional judgment has been externalized — packaged into specs, evals, review gates, and governance policies that are available to agents as well as humans. The operating model itself is versioned and improvable. The organization treats its own the way a software team treats a codebase: something to be maintained, extended, and refactored rather than accumulated and forgotten. This level is rare enough that most field reports describe the path toward it rather than operations at it.

The levels are useful not as a ranking but as a diagnostic. An organization at L2 that believes it is at L3 will keep trying to solve a consistency problem with throughput solutions — more models, more seats, faster generation — and get worse, not better. The bottleneck at each level requires a different kind of work to break: policy and access at L0, measurement and calibration at L1, standardization at L2, redesign at L3, externalization at L4. Misreading your level is how you spend a year on the wrong problem.

What AI-native looks like at different scales

Figure 09.4/Review is the new bottleneckCLICK TO ENLARGE

One of the persistent confusions in the field is that AI-native looks the same regardless of the size of the organization. It does not. A fifteen-person team building an AI product and a thousand-person enterprise deploying AI internally are both aiming for L3 — but the blockers, the failure modes, and the order of operations are almost entirely different.

For small teams — the seed-stage company or the five-to-fifteen-person product group — the biggest risk is not ungoverned sprawl. It is the false plateau. At this scale, individuals are aligned by proximity: there is no because everyone sits in the same room or the same Slack channel and shares context without needing a formal system to carry it. The trap is that this ambient alignment disappears exactly when the team needs to hire, hand off, or scale. The teams that become AI-native at seed scale are the ones that write things down early — specs, decision logs, conventions — not because anyone would otherwise get lost today, but because they are building the that the next ten hires will work inside. The bottleneck is not today's efficiency but tomorrow's onboarding.

For scale-stage companies — the fifty-to-two-hundred-person organization that has been through its first growth cycle — the dominant failure mode is silo proliferation. Every team builds its own AI tooling; every tool generates its own context; no context is shared. Organizations at this stage often discover a dozen different internal AI setups, each locally optimized and globally incompatible. The Bloomberg engineering organization, describing its AI deployment across a large and highly specialized technical workforce, found the hard part was not model quality but scale itself: the productivity gains that showed up in greenfield work dropped off sharply against hundreds of millions of lines of existing code, and getting past the early adopters meant treating rollout as an organizational program — onboarding and shared principles acting as the change agent — rather than a tooling decision. The lever at this scale is an explicit internal platform investment: a common layer through which AI capability is provided, measured, and governed. Without it, every L2 win stays local and the organization cannot compound.

For enterprise — the organization with hundreds or thousands of engineers, often operating in a regulated industry, often with technology debt measured in decades — the maturity journey has a prerequisite that the smaller-company conversation routinely omits: legal and compliance as a design input, not a later constraint. An enterprise cannot agent access without knowing what the agent can see, what it can do, and who approved both. The organizations that are succeeding with AI at enterprise scale describe getting governance architecture right as the enabling move — not the concluding one. The pilot that lives in a sandbox indefinitely is usually not waiting for a better model. It is waiting for someone to design the credential scope and the audit trail that would allow it to move to production.

The insight that cuts across all three scales is that the phase transitions are always governance events, not capability events. An organization does not move from L2 to L3 because the models got better. It moves because someone made a structural decision: hardened a path, wrote down a convention, built a review system. The capability was available one level below. What unlocks the next level is always an organizational act.

Broader creation, narrower paths to ship

If execution is cheap, the natural move is to let more people execute. This is one of the most radical organizational consequences in the corpus, and Lisa Orr at Zapier states it as a provocation: "Your support team should ship code." Not file tickets for engineers to ship code — ship it themselves. When the cost of a competent first implementation collapses, the historical reason for routing all code through a narrow guild of engineers weakens with it.

But the provocation only works with its other half. Broader creation does not mean a free-for-all; it means widening who can start work while narrowing and hardening the path by which work ships. The support engineer can open the pull request. What protects the company is that the path from that pull request to production runs through the same tests, the same evals, the same review gates, the same permission boundaries that any change runs through. Democratized creation and stronger governance are not opposites. They are the two things that have to rise together, or the first one becomes a liability.

This is why roles are blurring and new ones are appearing at the same time. James Lowe's argument that every product needs an AI product manager — and that it should be you — is really an argument that someone has to own the judgment layer that cheap creation now demands: deciding which of the many possible artifacts is worth shipping, and shaping the constraints under which non-specialists create safely. Denys Linkov's work on structuring a modern AI team points the same direction: the team that ships dependable AI is not a bigger version of the old team. It mixes capabilities that used to live in separate departments, because the unit of work now crosses those departments by default.

The pressure reaches compensation and hiring too. When output is no longer a clean proxy for contribution, the old pay structures wobble — Arman Hezarkhani's proposal to pay engineers more like salespeople, on outcomes rather than effort, is one early attempt to re-anchor reward to value in a world where effort is cheap. And hiring inherits a strange new problem that Beth Glenfield names directly: when everyone interviews with AI, the old signals of competence stop discriminating, and the organization has to learn to hire for judgment and taste rather than for the ability to produce code that an agent can now produce for anyone.

Review becomes the bottleneck

Follow the cheap-execution argument one step further and it lands on a single organizational chokepoint. If one person can direct many agents, and more people can now create, then the total volume of work produced rises far faster than the human capacity to review it. Review — not generation — becomes the binding constraint on how fast the organization can actually move.

This is the structural fact behind the Monday-morning scene. The weekend produced a pile of pull requests, drafts, and automations. Every one of them needs a human judgment somewhere before the organization can trust it. And the supply of trustworthy human judgment did not increase over the weekend. Maggie Appleton, describing collaborative AI engineering from inside GitHub, captures the resulting pathology precisely: going fast without good alignment leads to wasted work, duplicate effort, and giant review queues with little context. The queue is where ungoverned speed goes to die.

The instinct to handle this by asking humans to review harder does not scale, because it asks the scarce resource to absorb the growth of the abundant one. The organizations that cope build the review function the way Chapter 4 argued they should build evals: as a system, not a heroic individual act. Layered validation, where automated checks and evals clear the routine cases so human attention concentrates on the consequential ones. Triage rules that decide what a human must see and what can ship on green. Roll-up visibility of the kind Eric Zakariasson describes in Cursor's software-factory work — a single surface that shows what every agent is doing and, crucially, what the human actually needs to look at — rather than a firehose of individual agent chatter.

The connection back to Chapter 4 is direct and load-bearing: in an , eval and review capacity is not a quality-assurance afterthought. It is the throughput limit of the entire company. You can only safely create as fast as you can trustworthily review.

Alignment debt is the new invisible tax

There is a subtler failure than an overflowing review queue, and it is the one most likely to catch good teams by surprise. When individuals each direct their own fleet of agents in private, every workflow can be locally efficient and the whole can still be globally incoherent. Two engineers solve the same problem two different ways. A feature gets built that quietly conflicts with another team's assumptions. Work is duplicated, contradicted, or wasted — not because anyone was careless, but because alignment never happened.

Maggie Appleton names the root cause with unusual precision: "None of our current tools give teams a shared space to discuss plans, gather the right context, and work with agents as a collective." The tooling optimized the individual loop — one developer, many agents — and left the collective loop unbuilt. Each person's context lives in their own session. The plans never meet until the pull requests collide.

This is worth treating as a distinct organizational liability, and "" is a useful name for it. Like technical debt, it accrues invisibly while things feel fast, and it comes due all at once — as the duplicated work, the conflicting implementations, the surprise feature nobody coordinated, the giant unmergeable pile. And like technical debt, the cure is not to slow down but to pay alignment earlier: to move shared planning, context-gathering, and visible work decomposition upstream of the agent fan-out, so that two dozen agents are working from one understood plan rather than two dozen private ones. The operational rule is that the plan, not the pull request, is where two people's work should first meet. If the first time two engineers' agent work touches is at the merge — the point Appleton diagnoses, where the plans never meet until the pull requests collide — alignment was already skipped, and the only question left is how expensive the collision is.

The deeper point is that as execution fans out, alignment has to move in the opposite direction — it has to concentrate and move earlier. The cheaper it is to start work, the more expensive it becomes to have started the wrong work in parallel twenty times. is the tax an organization pays for treating a collective activity as a collection of private ones.

Governance as the load-bearing wall

The sections above on democratized creation and converge on a single structural conclusion: in an , governance is not overhead. It is the mechanism by which the organization can afford to be fast.

This inverts the standard corporate framing. The standard framing treats governance as the brake — the process that slows things down in the name of safety and compliance. Under that framing, the AI-native dream is to minimize governance: to find the path of least institutional resistance from idea to deployed output. That framing produces companies that move fast and then revert, recall, or face consequences they did not budget for.

The organizations that have built durable AI-native practices describe a different relationship. Joel Hron, CTO of Thomson Reuters — a company deploying AI across legal, tax, and risk workflows where, as he puts it, the risks of being wrong are not particularly acceptable — describes the shift that gives this book its title: the north star has moved from assistants that are merely helpful to systems asked to produce output and make judgments and decisions on behalf of users. Making that move in high-stakes work is not primarily a model-quality question; it is a trust question. And the book's argument is that a colleague is trusted with consequential work not because they can never be wrong but because there is a system around them — accountability, escalation, review — that bounds the damage when they are. Agents earn colleague status the same way. Governance is not the obstacle to trust; it is what trust is built from.

The practical consequence is that governance is not a Phase 3 problem to be solved after the agents are running. It is a Day 1 design question. An agent running with unbounded credentials and no audit trail cannot be granted access to the production system — not because the risk team said no, but because there is no basis on which to say yes. Scoped credentials, bounded authority, and comprehensive audit logs do not constrain the agent's capability. They are what allows the agent to be deployed at all. Chapter 7 made this argument for the technical security model; here it applies at the organizational level.

The organizations succeeding with enterprise AI at scale — across finance, legal, and regulated health contexts — consistently describe the same enabling sequence: define the governance architecture first, then expand the scope of what agents can do within it. The pilot that lives in a sandbox indefinitely is usually not waiting for a better model. It is waiting for someone to design the credential scope, the audit trail, and the escalation path that would allow it to move from demonstration to production. The governance design is the unlocking work.

The resulting principle is a reframe of the question every enterprise team should be asking. Instead of "what is the minimum governance we can get away with?", ask "what is the lightest governance that earns us the trust to go fast?" The answer is always less than the instinctive first draft — layers of approval designed for human-speed work — and always more than the proof-of-concept default of no governance at all. The right amount is whatever makes the next grant of access defensible to the person who has to sign off on it.

The company becomes a harness for its own agents

Pull the threads together and a single shape emerges. The organization that handles cheap creation, the review bottleneck, , and governance is doing, at the scale of a company, exactly what Chapter 3 said a good codebase does for a coding agent. It is building a .

An has done for itself what a good codebase does for a coding agent: it has taken the judgment that used to live tacitly in senior people's heads and the hallway conversation and made it explicit, versioned, and available to both humans and agents. Shared standards instead of folklore. Written policies instead of "ask Sarah." Permission systems instead of trust by default. Review gates instead of hoping.

This is the synthesis the whole book has been building toward: an organization is the macro-scale version of the same object. A company is a for its own agents, human and machine alike, and AI-native advantage is the quality of that .

That is also why the advantage is so hard to copy. A competitor can buy the same models and the same seats tomorrow. What it cannot buy overnight is an operating model in which institutional judgment has been packaged into reusable constraints — broad paths to create, narrow paths to ship, review built as a system, alignment paid upfront, and standards legible to the agents doing the work. That is built, not purchased, and the building is the moat.

What this means for what endures

The is not the one with the most enthusiastic prompting culture, and it is not the one with the highest activity dashboards. It is the one that learned to convert cheap generation into trusted throughput — which turns out to require the least glamorous things in the building: clear standards, real review, honest measurement, and alignment that happens before the work fans out rather than after it collides.

Notice what that list does not contain. It does not contain a particular model, a particular vendor, or a particular protocol. Every concrete technology in this book will be replaced; some of it already has been between the talks that anchor these chapters and the page you are reading. What persists is the shape of the problem: cheap execution makes judgment scarce, scarce judgment has to be organized, and organizing it is an engineering discipline applied to an institution rather than a codebase.

The technical question and the organizational question, in the end, turn out to be the same question asked at two scales. How do you build a system in which delegated work compounds instead of fragmenting? For a single agent, the answer was a . For a company, the answer is the same word at a larger size. The final chapter asks what survives when the models, the tools, and the org charts have all turned over again — and the answer it reaches for is already visible here, in the parts of the that were never really about AI at all.

What to do with this

  • Audit your engineering dashboards for activity-vs-outcome confusion. If you are reporting PRs opened, commits, or lines generated, you are counting the abundant resource; replace or supplement those with outcome measures — rework rate, share of generated work that ships unreverted, and time spent in the review queue — because counting artifacts in a world where artifacts are cheap is counting the wrong thing.
  • Treat review and eval capacity as the company's throughput limit, not a QA afterthought. Build the review function as a system rather than asking humans to review harder: add layered validation so automated checks and evals clear routine cases, triage rules that decide what a human must see versus what can ship on green, and a single roll-up surface (à la Cursor's software-factory work) showing what each agent is doing and what the human actually needs to look at.
  • Move alignment upstream of the agent fan-out. Make the plan, not the pull request, the place two people's work first meets — shared planning, context-gathering, and visible work decomposition before the agents start — so two dozen agents work from one understood plan instead of colliding at merge. accrues invisibly while things feel fast and comes due all at once.
  • Widen who can start work, then harden the single path by which it ships. Take Lisa Orr's provocation seriously — let non-engineers (e.g., support) open pull requests — but route every change through the same tests, evals, review gates, and permission boundaries, because democratized creation and stronger governance have to rise together or the first becomes a liability.
  • Name an owner for the judgment layer. Following James Lowe, designate an AI product manager who decides which of the many cheap artifacts is worth shipping and shapes the constraints under which non-specialists create safely — this judgment does not allocate itself once creation is cheap.
  • Close the last human-only path rather than buying more seats. The holdout process — the review, handoff, or plan that still assumes human-speed work — is the constraint. Partial delegation makes individual people faster inside an unchanged workflow; redesigning the workflow requires finding and converting that holdout. Identify it and convert it, because more seats in an unchanged process does not change the process.
  • Diagnose before you prescribe. Use the five-level model as a diagnostic, not a ranking: an organization at L2 trying to solve an L3 problem with L1 tools will spend a year optimizing throughput when the queue backing up is consistency. Find the actual bottleneck — access, measurement, standardization, redesign, or externalization — and work on that specifically.
  • Design governance architecture before expanding scope. Ask not "what is the minimum governance we can get away with?" but "what is the lightest governance that earns us the trust to go fast?" The pilot that lives in a sandbox indefinitely is usually not waiting for a better model — it is waiting for someone to define the credential scope, the audit trail, and the escalation path that would make production access defensible. Do that work first.
  • Write it down before you scale it. For small teams: the ambient alignment that makes your current AI usage feel coherent will not survive the next hiring round. Write the specs, decision logs, and conventions now — not because anyone needs them today, but because the next ten people will.

18 claims · 71 source anchors

Evidence — Source Anchors

The important transition is from suggestion to delegated execution

  • from helpfulness to productive
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • I think they need more
    #3 — Jacob Lauritzen, Legoraconfidence: high
  • most primitives the magic happens when you combine these things together
    #138 — Sam Bhagwat, Mastra.aiconfidence: high

Harness quality is a major determinant of coding-agent quality

  • a good harness is really operationalized around giving the model text at the right time
    #16 — Ryan Lopopolo, OpenAIconfidence: high
  • there's so much work that has been put in uh over the last you know 20 to 30 years around the automated validation and verification of software that you build
    #57 — Eno Reyes, Factory AIconfidence: high
  • instead of micromanaging, what I'm doing is I'm scaffolding and providing context.
    #190 — Eric Hou, Augment Codeconfidence: high
  • identifying problems with the code because if there's no problems then it's probably high quality code
    #179 — Josh Albrecht, Imbueconfidence: high

Agent-ready codebases are designed, not discovered

  • agents MD files an open standard
    #57 — Eno Reyes, Factory AIconfidence: high
  • context deficit as the biggest blocker.
    #190 — Eric Hou, Augment Codeconfidence: high
  • a garbage codebase you're going to get
    #621 — Matt Pocockconfidence: high

Evals are a control system, not just a test suite

  • improvement without measurement is limited and imprecise.
    #125 — Ido Pesok, Vercel v0confidence: high
  • We still want to build reliable scalable applications and that is still hard
    #184 — Samuel Colvin, Pydanticconfidence: high
  • eval to us it's actually the same problem from a from a systems perspective.
    #628 — Phil Hetzel, Braintrustconfidence: high
  • small CLI tool that we call eval tool
    #689 — Lawrence Jones, incident.ioconfidence: high
  • designed to allow agents to leverage our eval suite files.
    #689 — Lawrence Jones, incident.ioconfidence: high
  • classic benchmark maxing.
    #746 — Ara Khan, Clineconfidence: high
  • There are right ways to use them. There are wrong ways to use them.
    #746 — Ara Khan, Clineconfidence: high

Human oversight works best as an architectural layer, not an afterthought

  • There needs to be human interaction for approvals or other reasons and of course they need to be able to be uh able to run in parallel for efficiency
    #167 — Preeti Somal, Temporalconfidence: high
  • dial these agency dials far up.
    #206 — Joel Hron, Thomson Reutersconfidence: high
  • maintaining a factory would require you to have an overview of the processes you want your coding agents to go through.
    #629 — Eric Zakariasson, Cursorconfidence: high

The harness is evolving from a local loop into a staged software factory

  • getting to a place where you can build your own like software factory
    #629 — Eric Zakariasson, Cursorconfidence: high
  • unified agent harness that will manage
    #632 — Vaibhav Srivastav & Katia Gil Guzman, OpenAIconfidence: high
  • parallel agents working together to fix
    #42 — Robert Brennan, OpenHandsconfidence: high
  • The difference with missions is that we run features serially.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • Our longest mission ran for 16 days
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • We just kind of gave each role its own kind of context window.
    #691 — Ash Prabaker & Andrew Wilson, Anthropicconfidence: high
  • it's no longer about the model or the agent. It's about the process.
    #743 — Vincent Koc, OpenClawconfidence: high

The context gap increasingly includes capability packaging and progressive disclosure

  • doesn't have to be loaded immediately to context.
    #683 — Pedro Rodrigues, Supabaseconfidence: high
  • specifically with progressive disclosure.
    #654 — Nick Nisi & Zack Proser, WorkOSconfidence: high
  • 49% reduction of the initial load.
    #625 — Sam Morrow, GitHubconfidence: high
  • rich interactive components that render directly in the chat.
    #747 — Marlene Mhangami & Liam Hampton, GitHubconfidence: high

AI-native advantage depends on organizational coherence, not output volume alone

  • you'll be able to assemble agent teams that can complete tasks orders of magnitude harder than what you can complete with a single agent today.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • observing their workflows, their pain points, co-designing solutions with them
    #693 — Eoin Mulgrew, 10 Downing Streetconfidence: high
  • maintaining a factory would require you to have an overview of the processes you want your coding agents to go through.
    #629 — Eric Zakariasson, Cursorconfidence: high

Coordination is the unsolved runtime primitive for multi-agent systems

  • the thing that's missing for me is coordination.
    #704 — Lou Bichard, Onaconfidence: high
  • through sort of state machines, you know, by building out workflows and effectively state machines
    #704 — Lou Bichard, Onaconfidence: high
  • They step on each other's changes. They duplicate work. They make inconsistent architectural decisions.
    #653 — Luke Alvoeiro, Factoryconfidence: high
  • we have the two agents basically negotiate what done actually means.
    #691 — Ash Prabaker & Andrew Wilson, Anthropicconfidence: high

Context engineering is a primary engineering discipline, not a prompt trick

  • picking up the right documents and answering those questions is a really cool use case.
    #100 — Ofer Mendelevitch, Vectaraconfidence: high
  • cool load generator that Kalen wrote that lets you configure agent swarms uh and agent subtasks with very specific SLOs's
    #104 — Val Bercovici, WEKAconfidence: high
  • connect the dots with graph technology and solve problems like context engineering
    #105 — Stephen Chin, Neo4jconfidence: high
  • the right agent in the future is going to be this system that decides what type of search
    #157 — Will Bryk, Exa.aiconfidence: high

Enterprise usefulness scales with working-set quality, not corpus size

  • about 73% of LM customers implementing use cases say that factual accuracy is their top challenge right now.
    #100 — Ofer Mendelevitch, Vectaraconfidence: high
  • how Harvey tackles retrieval, the types of problems there are and then the challenges that come up with that all with like retrieval quality, scaling, uh security,
    #154 — Calvin Qi (Harvey) & Chang She (Lance)confidence: high
  • how to build enterprise aware agents. How to bring the brilliance of AI into the messy complex realities
    #193 — Chau Tran, Gleanconfidence: high
  • you don't need a trillion at once, you need the right million.
    #756 — Kuba Rogut, Turbopufferconfidence: high

Enterprise MCP adoption converges on gateways, blessed platforms, and a root of trust

  • we think that the goal for a secure this for any security team is to is to bless one platform.
    #624 — Karan Sampath, Anthropicconfidence: high
  • challenges we've faced building and scaling our remote server, how we've overcome them,
    #625 — Sam Morrow, GitHubconfidence: high
  • if we continue this pattern for hundreds or thousands of agents, we've got a pretty big security problem on our hand.
    #150 — Jared Hanson, Keycardconfidence: high

AI-native advantage is an operating-model redesign, not a procurement decision

  • there's a 10x difference between an org where 90% of the engineers are using AI versus an org where 100% of the engineers are using AI.
    #65 — Dan Shipper, Everyconfidence: high
  • 80% of respondents say LLMs are working well at work,
    #137 — Barr Yaron, Amplify (2025 AI Engineering Report)confidence: high
  • It's about evolving from AI features sprinkled into the product to rethinking how you plan, build, and deliver value all through an AI lens.
    #199 — From Hype to Habit (AI-first SaaS)confidence: high
  • writing code has never been the bottleneck, right? We can in uh we can increase productivity a bit by helping with code completion, but our our biggest bottlenecks are elsewhere within the SDLC.
    #62 — Justin Reock, DX (acq. Atlassian)confidence: high

Broader creation requires tighter review and governance — they rise together or the first becomes a liability

  • at Zapier we are empowering our support team to ship code.
    #69 — Lisa Orr, Zapierconfidence: high
  • I'm going to make the case for the AI product manager. I'm going to argue that AI expertise is really important for this role.
    #162 — James Lowe, i.AIconfidence: high
  • all these skills that you're prioritizing don't necessarily need to be one person. They can be multiple people.
    #188 — Denys Linkov, Wisedocsconfidence: high
  • I'm going to talk to you today about how I believe AI is breaking how we hire technically.
    #207 — Beth Glenfield, DevDayconfidence: high
  • the challenge becomes who do I say no to?
    #743 — Vincent Koc, OpenClawconfidence: high

Activity-based metrics misread motion as progress in AI-augmented work

  • these are not productivity metrics. They're useful, but you cannot just kind of use them like maximize them to maximize developer productivity.
    #79 — Yegor Denisov-Blanch, Stanford (120k devs study)confidence: high
  • I do think that AI increases developer productivity, but there's also cases in which it decreases developer productivity.
    #195 — Yegor Denisov-Blanch, Stanford (100k devs study)confidence: high
  • just plain old PR throughput. How many pull requests does the average engineer merge per week?
    #101 — Nick Arcolano, Jellyfish (20M PRs)confidence: high
  • I'm going to talk about how we pay engineers. And we pay engineers like salespeople.
    #63 — Arman Hezarkhani, Tenexconfidence: high

Review capacity is the throughput limit of an AI-native organization

  • this talk uh is called uh one developer, two dozen agents, zero alignment. Uh this is the case for why we need collaborative AI engineering.
    #623 — Maggie Appleton, GitHubconfidence: high
  • you should have multiple different stages where you you plan it, you produce it, you review it and you essentially follow the whole uh SLC
    #629 — Eric Zakariasson, Cursorconfidence: high
  • every software engineer becomes a code reviewer as basically their primary job.
    #54 — Max Kanat-Alexander, Capital Oneconfidence: high

Alignment debt is the AI-native equivalent of technical debt

  • None of our current tools give teams a shared space to discuss plans, gather the right context, and work with agents as a collective.
    #623 — Maggie Appleton, GitHubconfidence: high
  • if we believe that all of our products are for like for all time going to be probabistic, then like we probably have to figure out how this world works.
    #160 — Ben Stein, Teammatesconfidence: high
  • you kind of like frontload uh the context to the agents either through like a plan or a long spec and then you send them off
    #629 — Eric Zakariasson, Cursorconfidence: high

Cheap generation raises the value of taste and judgment rather than lowering it

  • software fundamentals matter now more than they actually ever have.
    #1 — Matt Pocock, AI Heroconfidence: high
  • capable of doing everything um immediately
    #6 — Tuomas Artman & Gergely Oroszconfidence: high
  • intentionally designed to put friction
    #14 — Armin Ronacher & Cristina Poncela Cubeiroconfidence: high