Every chapter is scored by the MASH judges across six dimensions — three of craft (humanness, voice, usefulness) and three of epistemics (evidence density, claim defensibility, non-redundancy). Higher is better; colour marks the band.
| Chapter | Humanness | Voice | Usefulness | Evidence | Defensibility | Non-redundancy |
|---|---|---|---|---|---|---|
| Book | 76 | 83 | 42 | 90 | 74 | 72 |
| 01 The Shift: From Assistant to Delegate | 72 | 91 | 33 | 90 | 75 | 100 |
| 02 Taste Still Matters When Code Gets Cheap | 74 | 82 | 39 | 90 | 79 | 62 |
| 03 Harnesses, Specs, and Codebases Agents Can Actually Use | 76 | 87 | 28 | 90 | 66 | 72 |
| 04 Evals Are the Control System | 74 | 84 | 48 | 90 | 72 | 78 |
| 05 Context Is Infrastructure | 77 | 74 | 41 | 90 | 79 | 72 |
| 06 Runtimes, State, and the Human Control Plane | 79 | 88 | 54 | 90 | 73 | 72 |
| 07 Security, Identity, and High-Stakes Trust | 75 | 87 | 55 | 89 | 76 | 82 |
| 08 Realtime, Voice, and the Cost of Being Interruptible | 72 | 82 | 39 | 90 | 68 | 78 |
| 09 The AI-Native Organization | 80 | 72 | 53 | 90 | 76 | 42 |
| 10 What Endures | 76 | 78 | 30 | 90 | 78 | 62 |
The usefulness judge scores every paragraph for operational density, so it floors the connective tissue every narrative chapter is made of — transitions, scene-setters, recaps, and chapter 10’s reflective register. The substantive core re-averages over the operational paragraphs, setting aside 68 of 321 (21%) that are either markdown headings mis-scored as prose or short bridge sentences the judge’s own rationale names as a transition. The gap is the genre ceiling; the floor that remains is real — prose that could carry a decision, a threshold, or a test and doesn’t yet.
The paragraph makes a specific and substantive claim: that agents expose a gap between standards a team *possesses* and standards a team can *operationalize*. This is a distinct, meaningful assertion about what working with coding agents reveals about team capability gaps. None of the provided ledger entries cover this claim. The closest entries (claims#3, #4, #5, #7) touch on scaffolding, harness quality, specs as executable intent, and designed codebases, but none of them address the notion of a possessed-vs-operationalized standards gap that agents surface. There is no ledger backing for this claim at any strength level, making it unsupported and triggering a fail.
The paragraph makes two distinct claims: (1) that parallelism can make things faster, and (2) that specialization makes process explicit. Neither of these claims has a matching ledger entry. The closest ledger entries touch on multi-agent coordination (claims#24), harness evolution (claims#14), and capability packaging (claims#17), but none of them support or even imply the specific assertion that "specialization makes process explicit." This is a substantive, non-trivial philosophical claim about the nature of specialization in AI coding systems, and it has no ledger backing whatsoever. The first claim (parallelism = speed) is also entirely absent from the ledger. Both claims are unsupported, triggering a fail.
The paragraph makes two substantive claims that require ledger backing: 1. **The move toward specialized agent roles** (research agents, review agents, refactor agents, debugging agents, subagent frameworks): While this is loosely adjacent to claims#14 (the harness evolving into a staged software factory) and claims#24 (coordination as an unsolved runtime primitive for multi-agent systems), the paragraph presents this as an established, interesting development in "modern coding systems" with no hedging. Neither ledger entry actually supports the specific claim that specialized roles are a significant or notable trend — claims#14 and #24 gesture at multi-agent complexity but do not assert the enumerated taxonomy of agent types or their prevalence. 2. **The direct quotation attributed to "OpenAI's Codex product documentation"**: This is a specific, verifiable factual claim — a verbatim quote from a named external source — and it has no corresponding ledger entry whatsoever. There is no claims entry supporting the existence, accuracy, or content of this quotation. This is a fabricated or at minimum unsupported citation, which is exactly the kind of claim the `fail` label is designed to catch. False positives are tolerable here; the absence of any ledger backing for a specific attributed quote is a ship-blocker. The paragraph's central rhetorical move — grounding the discussion of multi-agent specialization in a specific product quote — rests entirely on an unsupported claim.
The paragraph makes two substantive claims: (1) that evals encode "judgment rather than facts," and (2) that this judgment is "mined, not invented" — with the superlative framing that this is "the strongest answer in the corpus." Neither claim maps cleanly to any ledger entry. Claims#9 and claims#19 come closest: they assert that realistic evals must be grounded in natural tasks and operational history, and that evals are strongest when trace-linked and fed by production observability. However, neither ledger entry frames evals as encoding "judgment rather than facts" — that is a distinct epistemological claim about the nature of evals, not merely about their sourcing methodology. The "mined, not invented" framing is a rhetorical elaboration that could be loosely inferred from claims#9/19 but is not directly supported; the ledger entries speak to grounding and observability, not to a judgment-vs-facts distinction. The phrase "strongest answer in the corpus" is also an assertive meta-claim with no ledger backing. Because the core framing ("judgment rather than facts") and the rhetorical conclusion ("mined, not invented") lack direct ledger support, this paragraph fails the unsupported-claim threshold.
This paragraph is a structural summary/transition passage that describes what each chapter covers. While some claims touch on ledger-backed concepts (durable state and human oversight align loosely with claims#11 and claims#12), the paragraph introduces a substantial, substantive claim about Chapter 7's content — that "identity, permissions, sandboxes, audit" are the "price" of agentic authority — which has no corresponding ledger entry. There is no ledger entry covering security, identity, permissions, sandboxes, or audit as a theme. This is not a minor rhetorical flourish; it is the paragraph's climactic, forward-looking claim and its central concluding argument. Additionally, the framing of Chapter 5 as engineering "what [the agent] sees" is a context-engineering claim that is broadly consistent with claims#25, but the framing of Chapter 3 as preparing "the workplace," Chapter 4 as building "the instrument," and Chapter 6 as ensuring work "persists and stays under human control" are asserted as settled chapter characterizations without direct ledger support for those exact framings. Most critically, the entire security/authorization claim in the second half of the paragraph is wholly unsupported by any ledger entry, triggering a fail.
The paragraph makes a pointed, aphoristic claim: that "a helpful model can get away with being vague about power, but an acting system cannot." This is a substantive assertion about the relationship between model behavior, agency, and the need for explicit power/permission definition in agentic systems. While this claim is thematically adjacent to several ledger entries (e.g., claims#13 on tuning agency, claims#31 on least privilege, claims#30 on identity as a first-class object, claims#2 on chat being insufficient), none of the ledger entries directly support or back the specific claim being made. The statement is a standalone philosophical/architectural assertion — that vagueness about power is tolerable for assistive models but not for acting systems — and no ledger entry addresses this distinction or the permissibility of "vagueness about power" in either context. There is no ledger backing for this claim, making it unsupported.
The paragraph asserts that once latency and orchestration engineering is solved, "the next ceiling is conversational, not technical." This is a distinct architectural and product claim — that a conversational ceiling supersedes the technical one — for which there is no matching ledger entry. The closest entries (claims#20, #22, #29) address latency as a coordination/engineering problem and half-duplex as an architectural ceiling, but none of them assert or imply that a "conversational ceiling" exists beyond or above the technical one. In fact, claim#22 frames half-duplex as the silent architectural ceiling on natural voice conversation, which is itself still a technical framing. The prose introduces a new, unsupported claim that goes meaningfully beyond any declared ledger evidence, triggering a fail.
The paragraph makes several specific claims that have no matching ledger entries: 1. The attribution of the quote "Your support team should ship code" to "Lisa Orr at Zapier" — there is no ledger entry that supports this specific claim, the name, the company, or the framing. 2. The characterization that this is "one of the most radical organizational consequences in the corpus" — there is no ledger entry supporting this superlative framing. 3. The core argument that democratizing code execution (non-engineers shipping code) follows from cheap execution — there is no ledger entry covering this democratization-of-coding claim. While claims#35 (AI-native advantage is an operating-model redesign) and claims#1 (transition to delegated execution) are loosely thematically related, neither covers the specific claim that non-engineers (e.g., support teams) should or will ship code directly. The paragraph's central thesis — that collapsing implementation costs dissolves the engineering guild model — is entirely without ledger backing. The named attribution to "Lisa Orr at Zapier" is a fabricated or at least unverified specific claim with no ledger support whatsoever, which alone triggers a fail.