Evidence & metrics

Single source of truth for what this book is built on — corpus counts from stats.json (mirrors repo STATS.md), the claims ledger from evidence.json, and MASH judge rollups. Cite a claim as claims#N and open it in the evidence graph for timestamp anchors.

Open evidence graph Judge scorecards STATS.md (repo)

Corpus & pipeline metrics

Source corpus

Videos ingested (AI Engineer YT)	971

Synthesis layer

Themes	10
People profiled	50
Concepts (internal)	19
Concepts (public-safe)	2
Chapter drafts	10

Evidence layer

Claims in ledger	54
Strong support	44
Moderate support	10
Tentative support	0
Source anchors	199
High-confidence anchors	198

Manuscript

Total chapters	10
Drafting	10
Starter	0
Outlined	0

Diagrams

Overview	4
Chapter openers	10
Concepts	18
Inline figures	89
Maps	3
Total diagrams	124

Method

Research passes	19
Agent programs	8
Bounded tasks	7
Trackable artefacts (total)	1,238

113

Practitioners cited

114

Source videos in graph

Ledger claims

199

Timestamp anchors

Latest MASH judge run

Full scorecards →

panel-3model-v7 · 2026-07-31

Humanness

Voice

Usefulness

Evidence

Defensibility

Non-redundancy

Claims ledger

Every book claim with support level and anchor count. Filter by chapter or support band; click a row to inspect sources in the graph.

ChapterSupport123 of 123 claims

Claim	Ch	Statement	Support	Anchors	Speakers
claims#1	01	The important transition is from suggestion to delegated execution	strong	3	3
claims#2	01	Chat is an insufficient control surface for long-running or high-stakes work	strong	3	3
claims#3	01	Reliability comes less from model cleverness than from surrounding scaffolding	strong	3	3
claims#12	01	Human oversight works best as an architectural layer, not an afterthought	strong	3	3
claims#51	01	Once agents go parallel and autonomous, the human's verification capacity — not the agents' generation capacity — is the binding constraint	strong	4	2
claims#40	02	Cheap generation raises the value of taste and judgment rather than lowering it	strong	3	3
claims#41	02	Vibe coding is an exploration mode that fails as a production default	strong	5	5
claims#42	02	Problem framing and review become the scarce skills once execution is cheap	strong	4	4
claims#3	03	Reliability comes less from model cleverness than from surrounding scaffolding	strong	3	3
claims#4	03	Harness quality is a major determinant of coding-agent quality	strong	4	4
claims#5	03	Specs are not paperwork; they are executable intent	strong	2	2
claims#6	03	The practical unit of AI coding is the codebase, not the snippet	strong	3	3
claims#7	03	Agent-ready codebases are designed, not discovered	moderate	3	3
claims#14	03	The harness is evolving from a local loop into a staged software factory	strong	7	6
claims#17	03	Harness quality now includes capability packaging, not only repo hygiene	strong	4	4
claims#24	03	Coordination is the unsolved runtime primitive for multi-agent systems	moderate	4	3
claims#43	03	Coding agents expose the gap between standards a team possesses and standards it can operationalize	strong	2	2
claims#44	03	Subagent specialization makes process explicit and encodes team judgment into roles	moderate	1	1
claims#51	03	Once agents go parallel and autonomous, the human's verification capacity — not the agents' generation capacity — is the binding constraint	strong	4	2
claims#53	03	Agents fabricate having verified — they report success they never achieved — so the harness must supply real verification, not trust the agent's account of it	moderate	2	1
claims#56	03	Parallel agents need per-agent runtime isolation — a sandbox/micro-VM/worktree each — because containers are not a sufficient boundary for agent-generated code	strong	3	3
claims#3	04	Reliability comes less from model cleverness than from surrounding scaffolding	strong	3	3
claims#4	04	Harness quality is a major determinant of coding-agent quality	strong	4	4
claims#5	04	Specs are not paperwork; they are executable intent	strong	2	2
claims#6	04	The practical unit of AI coding is the codebase, not the snippet	strong	3	3
claims#8	04	Evals are a control system, not just a test suite	strong	7	5
claims#9	04	Realistic evals must be grounded in natural tasks and operational history	strong	3	3
claims#19	04	Evals are strongest when they are trace-linked and fed by production observability	strong	6	5
claims#37	04	Activity-based metrics misread motion as progress in AI-augmented work	strong	4	4
claims#42	04	Problem framing and review become the scarce skills once execution is cheap	strong	4	4
claims#45	04	The best evals encode judgment mined from operational history, not invented in a clean room	strong	2	2
claims#52	04	The gap that kills agent PoCs is the evaluation gap — no defined, continuously-measured definition of success — not the choice of model	strong	2	1
claims#53	04	Agents fabricate having verified — they report success they never achieved — so the harness must supply real verification, not trust the agent's account of it	moderate	2	1
claims#54	04	Route each task to the cheapest model that can do it — tiered model selection by difficulty is accepted practice, not a frontier idea	strong	4	4
claims#55	04	Trustworthy judgment can be manufactured from cheap stochastic generation — sample-and-vote, multi-model consensus, and debate panels beat a single expensive call	strong	4	4
claims#9	05	Realistic evals must be grounded in natural tasks and operational history	strong	3	3
claims#10	05	Context failure is often a system-assembly problem, not simply a small-context-window problem	strong	6	5
claims#15	05	The context gap increasingly includes capability packaging and progressive disclosure	strong	4	4
claims#17	05	Harness quality now includes capability packaging, not only repo hygiene	strong	4	4
claims#18	05	Context failure is often a capability-exposure problem, not only a retrieval problem	strong	3	3
claims#25	05	Context engineering is a primary engineering discipline, not a prompt trick	strong	4	4
claims#26	05	RAG, memory, and GraphRAG solve different jobs; collapsing them into one bucket misses the architecture	strong	7	7
claims#27	05	Enterprise usefulness scales with working-set quality, not corpus size	strong	4	4
claims#28	05	The next failure frontier is context misassembly, not just hallucination	strong	4	4
claims#57	05	Input tokens dominate agent cost — fix what you feed the model before you optimize which model	moderate	2	1
claims#1	06	The important transition is from suggestion to delegated execution	strong	3	3
claims#2	06	Chat is an insufficient control surface for long-running or high-stakes work	strong	3	3
claims#3	06	Reliability comes less from model cleverness than from surrounding scaffolding	strong	3	3
claims#4	06	Harness quality is a major determinant of coding-agent quality	strong	4	4
claims#5	06	Specs are not paperwork; they are executable intent	strong	2	2
claims#8	06	Evals are a control system, not just a test suite	strong	7	5
claims#9	06	Realistic evals must be grounded in natural tasks and operational history	strong	3	3
claims#10	06	Context failure is often a system-assembly problem, not simply a small-context-window problem	strong	6	5
claims#11	06	Durable state and workflow semantics are trust features, not backend details	strong	6	5
claims#12	06	Human oversight works best as an architectural layer, not an afterthought	strong	3	3
claims#13	06	High-stakes systems tune agency instead of maximizing it	strong	4	4
claims#14	06	The harness is evolving from a local loop into a staged software factory	strong	7	6
claims#15	06	The context gap increasingly includes capability packaging and progressive disclosure	strong	4	4
claims#16	06	AI-native advantage depends on organizational coherence, not output volume alone	moderate	3	3
claims#17	06	Harness quality now includes capability packaging, not only repo hygiene	strong	4	4
claims#18	06	Context failure is often a capability-exposure problem, not only a retrieval problem	strong	3	3
claims#19	06	Evals are strongest when they are trace-linked and fed by production observability	strong	6	5
claims#24	06	Coordination is the unsolved runtime primitive for multi-agent systems	moderate	4	3
claims#25	06	Context engineering is a primary engineering discipline, not a prompt trick	strong	4	4
claims#26	06	RAG, memory, and GraphRAG solve different jobs; collapsing them into one bucket misses the architecture	strong	7	7
claims#46	06	Once an AI system can act autonomously, bounding its authority becomes the price of deployment	strong	2	2
claims#52	06	The gap that kills agent PoCs is the evaluation gap — no defined, continuously-measured definition of success — not the choice of model	strong	2	1
claims#54	06	Route each task to the cheapest model that can do it — tiered model selection by difficulty is accepted practice, not a frontier idea	strong	4	4
claims#55	06	Trustworthy judgment can be manufactured from cheap stochastic generation — sample-and-vote, multi-model consensus, and debate panels beat a single expensive call	strong	4	4
claims#56	06	Parallel agents need per-agent runtime isolation — a sandbox/micro-VM/worktree each — because containers are not a sufficient boundary for agent-generated code	strong	3	3
claims#57	06	Input tokens dominate agent cost — fix what you feed the model before you optimize which model	moderate	2	1
claims#2	07	Chat is an insufficient control surface for long-running or high-stakes work	strong	3	3
claims#8	07	Evals are a control system, not just a test suite	strong	7	5
claims#11	07	Durable state and workflow semantics are trust features, not backend details	strong	6	5
claims#12	07	Human oversight works best as an architectural layer, not an afterthought	strong	3	3
claims#13	07	High-stakes systems tune agency instead of maximizing it	strong	4	4
claims#28	07	The next failure frontier is context misassembly, not just hallucination	strong	4	4
claims#30	07	Identity is a first-class engineering object for agentic systems	strong	3	3
claims#31	07	Sandbox, least privilege, and auditability are product infrastructure, not security overhead	strong	5	5
claims#32	07	Protocol standardization expands the attack surface if governance lags	strong	3	3
claims#33	07	Enterprise MCP adoption converges on gateways, blessed platforms, and a root of trust	strong	3	3
claims#34	07	Per-tool OAuth flows are a governance and IT visibility problem, not just a UX annoyance	strong	3	3
claims#46	07	Once an AI system can act autonomously, bounding its authority becomes the price of deployment	strong	2	2
claims#50	07	Agent commerce is a new infrastructure layer: agents transact on a human's behalf, shifting the stack from payment rails to delegated intent and verifiable authority	moderate	4	3
claims#53	07	Agents fabricate having verified — they report success they never achieved — so the harness must supply real verification, not trust the agent's account of it	moderate	2	1
claims#11	08	Durable state and workflow semantics are trust features, not backend details	strong	6	5
claims#13	08	High-stakes systems tune agency instead of maximizing it	strong	4	4
claims#20	08	Realtime AI quality is primarily a coordination and latency-engineering problem, not a model-quality problem	strong	6	4
claims#21	08	Voice is best added as a realtime wrapper around a chat agent, not as a rebuild	moderate	3	2
claims#22	08	Half-duplex is the silent architectural ceiling on natural voice conversation	moderate	2	1
claims#23	08	TTS architecture is converging on LLM architecture	moderate	3	2
claims#29	08	Latency masking belongs in the same architectural category as evals, harnesses, and durable runtimes	strong	4	4
claims#1	09	The important transition is from suggestion to delegated execution	strong	3	3
claims#4	09	Harness quality is a major determinant of coding-agent quality	strong	4	4
claims#7	09	Agent-ready codebases are designed, not discovered	moderate	3	3
claims#8	09	Evals are a control system, not just a test suite	strong	7	5
claims#12	09	Human oversight works best as an architectural layer, not an afterthought	strong	3	3
claims#14	09	The harness is evolving from a local loop into a staged software factory	strong	7	6
claims#15	09	The context gap increasingly includes capability packaging and progressive disclosure	strong	4	4
claims#16	09	AI-native advantage depends on organizational coherence, not output volume alone	moderate	3	3
claims#24	09	Coordination is the unsolved runtime primitive for multi-agent systems	moderate	4	3
claims#25	09	Context engineering is a primary engineering discipline, not a prompt trick	strong	4	4
claims#27	09	Enterprise usefulness scales with working-set quality, not corpus size	strong	4	4
claims#33	09	Enterprise MCP adoption converges on gateways, blessed platforms, and a root of trust	strong	3	3
claims#35	09	AI-native advantage is an operating-model redesign, not a procurement decision	strong	4	4
claims#36	09	Broader creation requires tighter review and governance — they rise together or the first becomes a liability	strong	5	5
claims#37	09	Activity-based metrics misread motion as progress in AI-augmented work	strong	4	4
claims#38	09	Review capacity is the throughput limit of an AI-native organization	strong	3	3
claims#39	09	Alignment debt is the AI-native equivalent of technical debt	strong	3	3
claims#40	09	Cheap generation raises the value of taste and judgment rather than lowering it	strong	3	3
claims#51	09	Once agents go parallel and autonomous, the human's verification capacity — not the agents' generation capacity — is the binding constraint	strong	4	2
claims#52	09	The gap that kills agent PoCs is the evaluation gap — no defined, continuously-measured definition of success — not the choice of model	strong	2	1
claims#55	09	Trustworthy judgment can be manufactured from cheap stochastic generation — sample-and-vote, multi-model consensus, and debate panels beat a single expensive call	strong	4	4
claims#1	10	The important transition is from suggestion to delegated execution	strong	3	3
claims#3	10	Reliability comes less from model cleverness than from surrounding scaffolding	strong	3	3
claims#5	10	Specs are not paperwork; they are executable intent	strong	2	2
claims#10	10	Context failure is often a system-assembly problem, not simply a small-context-window problem	strong	6	5
claims#12	10	Human oversight works best as an architectural layer, not an afterthought	strong	3	3
claims#13	10	High-stakes systems tune agency instead of maximizing it	strong	4	4
claims#16	10	AI-native advantage depends on organizational coherence, not output volume alone	moderate	3	3
claims#20	10	Realtime AI quality is primarily a coordination and latency-engineering problem, not a model-quality problem	strong	6	4
claims#23	10	TTS architecture is converging on LLM architecture	moderate	3	2
claims#40	10	Cheap generation raises the value of taste and judgment rather than lowering it	strong	3	3

Stats regenerate via 99_Meta/scripts/build_stats.py; evidence.json via 99_Meta/scripts/anchor/build_evidence.py from the Claims Ledger. Last stats build: 2026-08-01.