Harness Engineering
From Using AI to Controlling AI
Five interpretations from OpenAI, Anthropic, LangChain, Martin Fowler, and academia — merged into one system for engineers running AI agents in production
30+ technical books across 4 languages · Sold on Kindle in 6 countries · From a year of real production use
📖 Read for free
Read three full chapters right here before you buy. Liked it? Continue on Kindle.
01 Preface — Why 'Harness,' and Why Now

A Tuesday at 3 a.m.
3 a.m. on a Tuesday. The on-call engineer at one team gets jolted awake by a PagerDuty alert.
API costs have spiked. They check the dashboard: over $400 burned in the past hour. Digging in, they find that an AI agent deployed the day before has been hammering an unstable API with retries. Every error sends it back into the “let me try again” loop, and it ran like that until morning.
The agent wasn’t the problem. The model was fine. The prompt was carefully written. What was missing was a harness. They told the agent “run,” but never gave it brakes or a steering wheel.
This story isn’t unusual. There’s a phrase that gets passed around the field:
“The model is commodity. The harness is moat.”
When an agent that worked perfectly in a demo breaks in production, it’s almost always a harness problem.
In February 2026, OpenAI published a blog post: “Harness engineering: leveraging Codex in an agent-first world.”
Here’s what it said: for five months, an engineering team didn’t write a single line of code by hand. They built a production application of over a million lines using Codex agents alone. Build time: one-tenth of writing it manually.
“Humans steer. Agents execute.”
Engineers didn’t get their jobs taken. The definition of the job changed.
That post lit the fuse. Then came the “$47,000 retry storm” report from a weekend in February 2026. A data-enrichment agent misinterpreted an API error code as “retry with different parameters” and made 2.3 million API calls. Monday morning, engineers came back to a $47,000 bill. Nice that the agent worked over the weekend, but not great when the deliverable is zero and the invoice still arrives. A few days later Anthropic published two harness-design guides. LangChain defined “Agent = Model + Harness.” Martin Fowler wrote a commentary. An academic paper went up on arXiv.
2024 was the year of Prompt Engineering. The era of polishing “what to ask AI.”
2025 was the year of Context Engineering. Andrej Karpathy said “The hottest new programming language is English,” and the work shifted to designing “what to show the AI.”
In 2026, the scope widens to Harness Engineering. “How do you design the entire environment the agent operates in?”
But the term gets interpreted slightly differently depending on who’s writing. OpenAI and Anthropic emphasize different things. LangChain and Martin Fowler approach it from different angles. The academic papers come at it from yet another direction.
This book gives a structured overview of Harness Engineering.
- The relationship between the three engineering practices (Prompt / Context / Harness)
- How the major players (OpenAI / Anthropic / LangChain / Martin Fowler / academics) interpret it differently
- The anatomy of the six building blocks
- How it sits next to related ideas (Vibe Coding / Spec Coding / Agent Frameworks)
- Practical case studies from the Japanese-speaking community
- Where it’s all going
It’s both a concept-organization book and a hands-on guide you can use tomorrow. My goal is simple: when someone asks “okay, but what is a harness?”, you can hand them this book as a clear answer.
Who this book is for
- Engineers who have started using AI agents (Claude Code, GitHub Copilot, Cursor, etc.)
- People who have written an AGENTS.md or CLAUDE.md but aren’t sure if they got it right
- People who know Prompt Engineering but are hearing “Harness Engineering” for the first time
- Managers and tech leads who want to bring AI agents into their team
The only prerequisite is the basics of Prompt Engineering. Having heard of Few-shot and Chain-of-Thought is enough.
How to read this book
You can read it cover to cover, or jump to the chapters you find interesting. That said, three chapters are worth reading no matter what:
- Chapter 1: understand how the three engineering practices relate (the map of the territory)
- Chapter 8: learn the six building blocks (the skeleton of practice)
- Chapter 11: learn how to write AGENTS.md (something you can use tomorrow)
02 The Three Engineering Evolutions — Prompt → Context → Harness
Why 40% fail
In 2026, 40% of AI agent projects fail (Company of Agents survey).
What’s behind the failures? Wrong model? Bad prompts?
Neither. “The difference between success and failure isn’t the model.” That’s the consensus from the field.
A survey at Y Combinator DevTool Day (March 2026) interviewed CTOs and CPOs and found a common factor across failed projects: no harness. They never designed the environment the agent operates in.
75% of YC’s enterprise companies already have coding agents deployed. Yet many of them hit the same wall: “works in the demo, collapses in production.”
In March 2026, Linear declared “issue tracking is dead.” The reasoning: feed issue context straight to a coding agent and humans no longer need to manage tickets. Enterprise workflows are getting redesigned with agents as the default assumption.
Putting agents into production at this inflection point without understanding the harness is like driving on the highway without a seatbelt. You can go fast, but you’ll fly off the road at the first curve.
Timeline

What makes the three different
Prompt Engineering
Subject: A single prompt (input text)
Optimizing “what to ask AI and how.” Few-shot, Chain-of-Thought, ReAct. The art of maximizing accuracy in one exchange.
Context Engineering
Subject: Everything you feed the AI (system prompt + RAG + tool definitions + memory)
In Andrej Karpathy’s words, “it is a lot more than just the prompt itself.” As single-prompt approaches became insufficient in more cases, teams had to design the entire dynamically constructed context window.
Philip Schmidt (formerly Hugging Face, Google DeepMind) argues that “the new skill for using AI isn’t prompting. It’s context engineering.”
Harness Engineering
Subject: The entire operating environment (context + constraints + tools + lifecycle + feedback + monitoring)
Louis Bouchard’s definition is the most concise:
Context Engineering is “what you send to the model.” Harness Engineering is “how the whole thing runs.”
Not the prompt, not the context. The environment around the model. If cooking is the analogy, the prompt is the recipe, the context is the ingredients, and the harness is the kitchen itself.
A nesting structure
These three aren’t competing concepts. They nest inside each other.

SmartScope’s article puts it cleanly:
Harness ⊇ Context ⊇ Prompt
Elephancube’s Japanese article uses an apt metaphor:
When you build a house, walls need a foundation, and a roof needs walls. Good prompts let context design work, and good context design lets the harness function.
”Replaced” or “layered”?
Here’s where interpretations diverge.
The “replaced” camp:
Data Science Dojo titled an article “Why Harness Engineering Is Replacing Prompt Engineering.” Their argument: agents in 2025–2026 operate in environments that prompts and context were never designed for.
The “layered” camp:
AnyTech (Medium) writes: “There’s no essential difference among the three; the terminology is shifting because LLMs and agents now handle a broader scope of work.” A reassuring take. You don’t have to throw out everything you knew each time a new buzzword arrives.
This book’s position: the layered camp. Prompt Engineering is still important. It’s just no longer sufficient on its own in a growing number of cases. Harness Engineering subsumes prompt and context, then adds an outer layer of constraints, lifecycle management, and feedback loops.
Why now?
A piece by WonderLab on DEV.to puts it well:
The timing isn’t a coincidence. In 2025, AI agents went from “cool demos” to “actual productivity tools.”
Once agents run autonomously for long stretches, optimizing one prompt can’t keep them under control. Context design alone is also insufficient. You have to design the whole environment.
That urgency is what gave birth to Harness Engineering.
Continue this chapter on Kindle →03 Defining Harness Engineering
What “works in the demo, breaks in production” really means
harnessengineering.academy puts it this way:
Don’t deploy AI agents without a harness, the same way you wouldn’t run software directly on a CPU without an OS.
A CPU can compute. But without an OS, you can’t manage memory, schedule processes, or control I/O. Same for models: a model can generate text. But without a harness, you can’t manage context, control tools, or handle failures.
Nine out of ten “works in the demo, breaks in production” agents are a harness problem. To be specific:
- Demo: A controlled environment. Questions arrive in the expected flow. APIs work. Context is short.
- Production: Chaos. Unexpected inputs. APIs go down. Context blows up. Race conditions from parallel execution.
A harness is the cushion that absorbs production chaos. The demo is the showroom; production is the open road. Whether a car that runs perfectly in the showroom can survive the road is a separate question.
Where the word “harness” comes from
NxCode spells out the etymology:
The term is borrowed from equestrian equipment. A horse is powerful and fast, but without reins, a saddle, and a bridle, it goes wherever it wants. The AI model is the horse. The harness is everything that channels that power into productive work.
A note post by kazu_t uses an OS-vs-application-code analogy:
If the prompt is application code, the harness is the OS.
Aakash Gupta (Medium) puts it even more simply:
The model is the engine. The harness is the car. The best engine in the world goes nowhere without steering and brakes.
Distinguishing it from “test harness”
Parallel.ai raises an important caveat:
Don’t confuse it with a test harness (an old term in software engineering). A test harness is a framework that feeds inputs and auto-checks outputs. An agent harness is the entire operating environment of an AI.
Search the term and you’ll get hits about electrical wiring and the CI/CD platform Harness.io. The harness in this book refers to the control environment for AI agents.
Comparing the definitions
Here are the definitions side by side.
OpenAI
“Humans steer. Agents execute. By deliberately imposing this constraint, we built what was needed to lift engineering speed by orders of magnitude.”
Harness = the environment in which agents reliably write code.
Anthropic
“Multi-context-window support, environment setup in the initial context, context management, sub-agent composition.”
Harness = a stable control system for long-running agents.
LangChain
“Agent = Model + Harness. The model has the intelligence; the harness makes that intelligence useful.”
Harness = the outer shell that converts model intelligence into useful work.
Martin Fowler
“Strongly typed languages turn type checks into sensors. Module boundaries provide architectural constraint rules. Frameworks like Spring abstract away details the agent doesn’t need to think about, implicitly raising the agent’s success rate.”
Harness = the total set of implicit and explicit constraints embedded in a codebase.
Louis Bouchard
“Stop saying ‘the model is dumb.’ Say instead, ‘my system tolerated this failure mode.’”
Harness = environment design that doesn’t tolerate failure modes.
What they all agree on
The wording differs, but everyone agrees on a few points.
- The harness is outside the model: this isn’t about tweaking model parameters
- Constraints are enforced, not requested: the system doesn’t move forward unless they’re satisfied
- Feedback loops are mandatory: evaluate outputs, keep improving the environment
- The human role changes: from writing code to designing the environment
This book’s working definition of Harness Engineering
Combining the definitions above, this book uses:
Harness Engineering is the discipline of designing the entire environment in which AI agents operate autonomously over long periods of time. It includes context management, constraint enforcement, lifecycle management, feedback loops, monitoring, and security boundaries.
What goes wrong without a harness
The value of a harness becomes clear when you look at what fails without one.
| Problem | Without a harness | With a harness |
|---|---|---|
| Code style consistency | Agent writes in a different style every time | Linter hook auto-unifies |
| Test creation | Have to ask “please write a test” every time | Pre-commit blocks untested commits |
| Handling secrets | Agent embeds API keys in code | Security boundary detects and rejects |
| Long-running tasks | Context bloats, quality drops | Context resets + progress files |
| Reproducible quality | Depends on whoever’s working (human or AI) | Guaranteed by the environment |
A harness turns “asks” into “mechanisms.” Saying “please write tests” 100 times is less reliable than building the system once so commits without tests can’t happen. Same as training junior team members.
The decisive difference from Prompt Engineering
Prompt Engineering optimizes “one exchange.” Harness Engineering optimizes “100 exchanges.”
For a single exchange, a good prompt is enough. But when an agent codes all day, the effect of the first prompt has faded by the 50th. Context bloats, the original instructions slip into the distant past, and the agent starts behaving differently than it did at the start.
A harness solves that. If a prompt is “the first push,” a harness is “the gravity that’s always pulling.”
From the next chapter, we examine each player’s interpretation one by one.
Continue this chapter on Kindle →Overview
Harness Engineering, mapped across the 5 interpretations from OpenAI, Anthropic, LangChain, Martin Fowler, and academia. The first systematic guide that distills the 6 building blocks, the AGENTS.md/CLAUDE.md/hooks implementation patterns, and Self-Evolving Agents — the practical reference for the 2026 keyword.
What you will be able to do
- Decompose any harness into the 6 building blocks framework
- Choose between AGENTS.md, CLAUDE.md, and hooks for each task
- Compare interpretations from OpenAI Codex, Anthropic, LangChain, Martin Fowler, and academia in one place
- Implement Self-Evolving Agent patterns (self-improving harness)
- Place tools like Vibe Coding, Spec Coding, and Agent Frameworks on a clear technology map
Who is this book for
- [AI Agent Developer] Want the systematic view of harness as the 2026 keyword
- [Claude Code User] Ready for the layer above CLAUDE.md
- [Tech Lead] Designing AI agent ops across an entire team
- [Researcher] Comparing OpenAI, Anthropic, and LangChain interpretations side-by-side
- [Self-Evolving Curious] Looking to build self-improving agents
- [Tool Picker] Mapping Vibe Coding, Spec Coding, and Agent Frameworks
Problems this book solves
- I hear 'Harness Engineering' a lot but can't actually explain what it is
- OpenAI and Anthropic seem to define it differently
- The line between AGENTS.md and CLAUDE.md feels blurry
- I don't know when to reach for hooks
- Self-Evolving Agent design patterns aren't clear to me
- The boundary between harness and Agent Frameworks (LangChain etc.) is murky
Where this book stands
- Cross-vendor (5 interpretations compared in one book — first of its kind)
- Implementation-focused (not just theory — concrete AGENTS.md / hooks examples)
- Intermediate to advanced (Claude Code / CLAUDE.md basics assumed)
- Harness-specific (single topic, 19 chapters of depth)
Why this book
- First book to integrate the 5 interpretations from OpenAI, Anthropic, LangChain, Martin Fowler, and academia
- Six-building-block framework for systematizing 'what is harness?'
- Goes all the way to Self-Evolving Agents (self-improving harness) and future predictions
- Real implementation patterns for AGENTS.md / CLAUDE.md / hooks with concrete examples
- Built on a Zenn article that drew 12,000 views — this is the full-fledged version
How this differs from other AI books
| Compared to | This book's difference |
|---|---|
| Vendor docs (OpenAI / Anthropic / LangChain) | Not single-vendor view. This integrates 5 interpretations and explains why they disagree. |
| Prompt / Context Engineering books | Tackles the layer above prompt and context — the third tier of the stack. |
| Agent Framework guides (LangChain Agents etc.) | Not framework-specific. Maps the boundary between harness and Agent Frameworks. |
Table of contents
- 01 Preface — Why 'Harness' now Free preview
- 1-1 A Tuesday at 3 a.m.
- 1-2 Who this book is for
- 1-3 How to read this book
- 02 The Three Engineerings (Prompt → Context → Harness) Free preview
- 2-1 Why 40% fail
- 2-2 Timeline
- 2-3 What makes the three different
- 2-4 A nesting structure
- 2-5 "Replaced" or "layered"?
- 2-6 Why now?
- 03 Harness Engineering: Definition and Big Picture Free preview
- 3-1 What "works in the demo, breaks in production" really means
- 3-2 Where the word "harness" comes from
- 3-3 Distinguishing it from "test harness"
- 3-4 Comparing the definitions
- 3-5 What they all agree on
- 3-6 This book's working definition of Harness Engineering
- 3-7 What goes wrong without a harness
- 3-8 The decisive difference from Prompt Engineering
- 04 OpenAI's Take — Codex and the million-line experiment
- 05 Anthropic's Take — Harness for long-running agents
- 06 LangChain's Take — Agent = Model + Harness
- 07 Martin Fowler's View — The implicit harness in every codebase
- 08 The Academic View — arXiv papers and formal specification
- 09 The Six Building Blocks — Anatomy of a harness Free preview
- 10 Technology Map — Vibe Coding / Spec Coding / Agent Framework
- 11 Reconciling the Differences — What everyone agrees and disagrees on
- 12 AGENTS.md / CLAUDE.md Practical Design
- 13 Hooks / Lifecycle / Feedback Loops
- 14 Self-Evolving Agent — A harness that improves itself
- 15 The Future of Harness Engineering
- 16 Afterword
- 17 References Free preview
- 18 About the Author Free preview
- 19 Colophon Free preview
The phrase Harness Engineering is everywhere, and means something different to everyone. OpenAI talks about scaling Codex. Anthropic talks about long-running agents. LangChain frames it as Agent = Model + Harness. Martin Fowler points out that every codebase already has an implicit harness.
Each of them is right. But until now, no book has stitched these views into a single system.
This book maps what a harness is, how to design one, and how to operate it. It synthesizes the 5 interpretations into 6 building blocks, then walks through implementation with AGENTS.md, CLAUDE.md, and hooks, all the way to Self-Evolving Agents.
“Prompt was 2024. Context was 2025. Harness is 2026.”
Related books
Dive deeper with related articles
Read on Kindle
Included in Kindle Unlimited
Read on Kindle* This page contains Amazon Associates links. Purchases may earn the author a referral fee.