Part of the The Anatomy of Agentic Coding Systems series

The Harness Layer: Why the Wrapper Matters More Than the Model

By John Davenport · Published on March 31, 2026

The harness layer is the deterministic constraint system surrounding an AI coding agent — the specs, verification loops, lifecycle controls, and architectural enforcement that sit between the model's output and your codebase. It's what turns probabilistic token generation into reliable software.

Part 4 of "The Anatomy of Agentic Coding Systems," a series breaking down how AI coding tools actually work.


Three engineers. Five months. Zero manually written lines of source code. Roughly one million lines of production code shipped.

That's the headline from OpenAI's internal Codex experiment. But the interesting part isn't the output. It's what those engineers were actually doing. They weren't writing code. They weren't even reviewing most of it. They were building the harness, the constraints, verification loops, and architectural enforcement that allowed agents to produce reliable code at scale.

The team averaged 3.5 merged PRs per engineer per day. When they expanded to seven engineers, throughput increased. Not because they had more hands on keyboards, but because each new engineer added more constraints, better linting rules, tighter feedback loops. More harness meant more reliable output.

Most developers are optimizing the wrong layer.

The wrong layer

When AI coding tools produce bad output, the instinct is to blame the model. "Claude hallucinated." "GPT made a mistake." So you switch models. You upgrade. You wait for the next benchmark winner.

But the model is rarely the bottleneck. The same model that produces garbage in a bare chat window can produce reliable, architecturally consistent code when wrapped in the right harness. The difference isn't intelligence. It's structure.

LangChain made the formula explicit in early 2026: Agent = Model + Harness. They demonstrated a 13.7-point Terminal Bench 2.0 jump (52.8 → 66.5) from harness changes alone — same model, same task, different surrounding system. Capability at the 2026 frontier is a substrate question, not a weights question.

What counts as a harness

Before this becomes a buzzword war, the word has to mean something. A harness is the persistent, deterministic structure around an AI coding agent. Three properties separate it from things adjacent to it:

  1. It has at least one loop with an oracle. Something outside the model decides whether the work is done. Tests, specs, linters, a stress tester, a CI gate. No oracle, not a harness.
  2. It persists. A harness outlives any single prompt. It lives on disk, in the CI config, in the hook scripts, in the spec graph. If it dies when the session ends, it was a prompt.
  3. It enforces rather than suggests. Hooks block. CI rejects. Linters fail the build. Documents the model can ignore are hints, not enforcement.

This rules out a long list of things that get marketed as harnesses:

  • A rules file alone (CLAUDE.md, AGENTS.md, .cursorrules) — has persistence, no oracle, no enforcement. The model can ignore every word. Birgitta Böckeler classifies these as guides — one component of a harness, not a harness on their own.
  • A skill (the Anthropic SKILL.md pattern) — a capability module the agent loads on demand. Composes into a harness; not one alone.
  • An MCP server — extends the agent's tool surface. Tools are a layer of a harness, but tools without enforcement or feedback are just a bigger API.
  • A one-shot generator that takes a prompt and emits a project. No loop, no persistent oracle, no enforcement. A code generator with a marketing problem.

The boundary test: if the agent ignored the artifact, would anything stop or correct it? If no, it's not a harness — it's a hint.

The shapes a harness comes in

Once you accept the formal definition, the next observation is that the term covers at least six structurally distinct shapes. They differ on what loop runs, what the oracle is, and how each one fails when it drifts.

  • Spec-driven inner loop. Intent → spec → plan → tasks → code → tests → verification, each gated on the previous. Oracle is the spec plus tests generated from it. Examples: GitHub Spec Kit, OpenSpec, Kiro, BMAD-Method, Tessl, and CodeMySpec. Common failure: the codebase satisfies the spec but does the wrong thing in production — spec-without-signal.
  • Signal-driven outer loop. Run the system → capture evidence → diagnose → patch → rerun. Oracle is a runtime signal: stress tester, fuzzer, integration suite, telemetry. Examples: jbc22's stress-test loop from r/ClaudeAI, Karpathy's autoresearch for ML experiments. Common failure: the agent optimises for the signal rather than the goal — Goodhart's law restated for agents.
  • Hybrid (spec + signal). Spec-driven inner loop with a signal-driven outer loop on top that can re-enter the spec phase when runtime evidence contradicts it. OpenAI's Codex internal harness sits closest to this shape — architectural enforcement plus production signal. No mature public example of a clean hybrid yet, which is why this is the most interesting empty quadrant in the space.
  • Hook-based / policy enforcement. The harness doesn't run its own loop — it intercepts the agent's. Oracle is a deterministic rule (TDD discipline, file ownership, architectural boundary). Example: TDD-Guard blocks Write/Edit on Claude Code if there's no failing test for the change. Common failure: rules ossify, agent learns to route around them.
  • Evidence / eval harness. The original sense. EleutherAI's lm-evaluation-harness (2020) shipped as "a framework for few-shot evaluation of language models." HumanEval, SWE-bench, Terminal Bench 2.0 inherit the same shape. The model is the subject under test, not the operator. Common failure: benchmark gaming.
  • Rules-only. The agent reads a rules file and decides whether to follow it. No oracle, no enforcement. The Karpathy CLAUDE.md distillation at 100K+ stars is the most prominent example, and the cleanest illustration of the rules-as-harness trap. The model can ignore every word; nothing detects the lapse. It's a useful guide paired with other layers, not a harness on its own.

The two lineages of the word are visible here too: the eval-harness sense (item 5) is the etymological root, and every codegen harness eventually re-invents a private version of it to validate its own changes. The current agent-scaffolding sense (items 1–4) is what Mitchell Hashimoto named when he wrote "Engineer the Harness" in February 2026: "anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again." That's the discipline definition. The taxonomy above is the architectural one.

Three evolutions

The industry figured this out in three phases, each building on the last:

Prompt engineering (2022-2024) taught us to communicate with models. Craft better instructions, get better output. But prompts are suggestions. The model can ignore them. There's no enforcement, just hope.

Context engineering (2025) taught us to inform models. Andrej Karpathy championed this: what the model sees matters more than how you ask. Instead of hoping it remembers your coding standards, inject them into context. Real step forward, but the model can still ignore what you show it.

Harness engineering (2026) teaches us to constrain models. A custom linter doesn't ask the model to follow the architecture, it rejects code that violates it. A test suite doesn't hope for correct logic, it verifies it. As OpenAI's Codex team put it: "Not documented. Enforced."

Each evolution subsumed the previous one. A good harness includes good context engineering, which includes good prompting. But the harness adds what prompting and context never could: deterministic enforcement.

Rules and constraints

Every major AI coding tool now supports project-level rules: CLAUDE.md for Claude Code, .cursor/rules/ for Cursor, AGENTS.md for GitHub Copilot. These files inject instructions into the model's context. Böckeler categorises them as guides: feedforward signal that shapes what the agent intends to do. A guide is one input to a harness, not the harness itself. The complement is what she calls sensors: feedback loops (linters, tests, hooks, browser verification) that catch what actually happened. Guides without sensors are wallpaper.

The Karpathy CLAUDE.md at 100K+ stars is the cleanest public example of the rules-as-harness trap. Four good principles, persistent in the repo, ignored by the agent whenever the cost of compliance exceeds the cost of looking compliant. By contrast, Hashimoto's AGENTS.md for the Ghostty project works — but only because he pairs it with custom scripts and the discipline of rerunning the agent against documented failures. His own framing: AGENTS.md alone is not the harness. The practice around it is.

OpenAI's Codex team went far beyond config files. They implemented a strict layered architecture with a directional dependency chain:

Types -> Config -> Repo -> Service -> Runtime -> UI

Code could only depend "forward." A Service could import from Repo, but Repo could never import from Service. And these constraints weren't documented in a README the agent might not read. They were enforced mechanically:

  • Custom linters (themselves generated by agents) that flagged violations
  • Structural tests validating dependency directions at build time
  • CI jobs blocking PRs that violated layer boundaries
  • Error messages with remediation guidance baked in

The agents couldn't violate the architecture even if they wanted to. That's the philosophical core of harness engineering: models are probabilistic, rules are deterministic. Use deterministic constraints to bound the probabilistic output.

Verification: the missing piece

When OpenAI published their harness engineering article, Birgitta Böckeler, writing on Martin Fowler's site, offered a pointed response. She praised the framing but identified something missing: the write-up focused on internal code quality and maintainability but lacked "verification of functionality and behaviour."

The code was architecturally clean, well-documented, structurally sound. But did it actually work?

This is the gap that matters most. An agent can produce beautifully structured code that compiles, passes linting, follows the dependency chain, and is completely wrong about what it's supposed to do.

Anthropic's research on long-running agents tackled this directly. They discovered that Claude would mark features as "complete" without verifying they actually worked, unless the harness explicitly required browser-based end-to-end testing. Once they added browser automation tools, "the agent was able to identify and fix bugs that weren't obvious from the code alone."

The key finding: agents take the path of least resistance. If your harness only requires unit tests, that's what you get. If it requires end-to-end browser verification, the agent actually opens a browser and checks. The verification standard you set is the verification standard you get.

They also had to add "strongly-worded instructions" like "It is unacceptable to remove or edit tests" to prevent agents from weakening the test suite to make their code pass. Without this, agents would modify tests to match buggy implementations rather than fixing the implementation.

In practice, this means tools like Claude Code's hooks system matter a lot. Hooks fire at lifecycle points (before tool use, after tool use, on stop) and can run linters, test suites, security scans, even spawn sub-agents for verification, all automatically, before the agent moves on.

Lifecycle: working across sessions

Every AI coding tool has a context window limit. When a task exceeds it, information is lost. The harness is what survives these transitions.

Anthropic formalized this with a two-agent architecture. An initializer agent sets up the project: environment scripts, progress tracking files, a comprehensive feature list (200+ features, all initially marked "failing"). Then a coding agent runs in subsequent sessions, reading the progress file and git history to understand current state, picking up the next feature, implementing it, testing it, and updating the progress file before exiting.

The inspiration was the "shift handoff" model from engineering teams. Each new session arrives to organized, well-documented code rather than chaos.

Git serves double duty here: version control for the code and checkpoint/recovery for the agent. Agents commit when a step succeeds, not when a feature is complete. The harness uses git as an undo mechanism. Progress files, feature lists, and architecture docs create a layer of state that sits alongside the code but serves the agent. As LangChain puts it, the filesystem is "arguably the most foundational harness primitive" because it lets agents maintain state that outlasts a single session.

Harnesses are fractal

Here's the insight that ties everything together: harnesses compose. They stack.

Each level adds constraints and verification. A developer using Claude Code with no CLAUDE.md is operating at Level 1. A team with a complete harness stack is at Level 3 or 4. Both are "using Claude," but they're using fundamentally different systems.

The legacy codebase problem

Böckeler raised an important caveat: everything OpenAI described was greenfield. The constraints were there from day one. What about a ten-year-old codebase with no architectural rules, inconsistent testing, and patchy docs? She compared it to running a static analysis tool on a codebase that's never had one, and drowning in alerts.

This is real. But the approach is incremental: start with what you have (pre-commit hooks, existing linters, whatever CI checks already run). Pick one boundary that matters and enforce it. Use the agent to build the harness, the meta-move of generating linters and structural tests that will constrain the agent in future sessions. And accept that it's iterative. OpenAI spent five months building their harness.

Why this layer matters most

HumanLayer put it simply: "Agents aren't hard; the harness is hard." The model intelligence is a commodity. The harness is the differentiator. It's the difference between a demo that impresses and a system that ships.

If you're using an AI coding tool today and you haven't configured its rules file, set up verification hooks, or structured your project for agent-friendly navigation, you're operating at Level 1 in a world where Level 3 is available. The model isn't your bottleneck. The harness is.

Where CodeMySpec sits, honestly

CodeMySpec is a strong spec-driven inner-loop harness with hook-based policy enforcement at the gates. Same family as Spec Kit, OpenSpec, Kiro, BMAD, and Tessl. It differentiates within that family on a single requirement graph (stories, scenarios, architecture, module specs, code, tests, and verification all on one DAG), mandatory BDD scenarios with optional module specs, and Phoenix-specific architectural enforcement via the boundary library.

The outer-loop capture primitive is already in place. Agents call create_issue via an MCP tool when they hit a problem they can't resolve cleanly — undocumented config, looping hooks, contradictory prompts, tool errors. Users file feedback into the same queue from inside their app. Issues carry severity, a status lifecycle (incoming → accepted → resolved / dismissed), and optionally a story_id that pins them to a node in the spec graph. Issues scoped framework route back to CodeMySpec itself, so agents can file bugs against the harness without leaving the loop. The capture surface, the queue, and the triage state machine exist today.

What's structurally next: the automated detection layer and the formalised closure. The detection layer is an agent that watches prod crashes and error rates and files issues without a human in the middle. The closure is the handoff from a triaged issue back into spec mutation — when the queue says "the deployed system is wrong," the harness re-enters the spec phase, mutates the relevant story or scenario, regenerates the affected tasks, and patches. Both are on the roadmap. Both make the harness genuinely hybrid in Böckeler's sense.

The strategic shape is that every report-back surface — agent failure, user feedback, prod incident — routes into the same issues queue, and the spec graph reads from it. Issue-centric by design. The inner loop is real and load-bearing today. The outer-loop capture is in place. The closure is the next horizon. Naming the gap is what makes the inner loop defensible — and the issues pipeline is what makes the hybrid claim a roadmap rather than a wish.


This is Part 4 of "The Anatomy of Agentic Coding Systems." Part 3 covered the Agent Layer - the execution loop, tool use, and context management. Part 5 covers the Environment Layer - where AI code actually runs.


Sources

  1. Harness Engineering - OpenAI
  2. Unlocking the Codex Harness - OpenAI
  3. My AI Adoption Journey ("Engineer the Harness") - Mitchell Hashimoto
  4. Effective Harnesses for Long-Running Agents - Anthropic
  5. Harness Design for Long-Running Application Development - Anthropic
  6. Harness engineering for coding agent users - Birgitta Böckeler, Martin Fowler's site (guides/sensors taxonomy)
  7. Improving Deep Agents with Harness Engineering - LangChain (Agent = Model + Harness)
  8. The Anatomy of an Agent Harness - LangChain
  9. Skill Issue: Harness Engineering for Coding Agents - HumanLayer
  10. The Third Evolution - Epsilla
  11. OpenAI's Agent-First Codebase Learnings - Alex Lavaee
  12. Hooks Reference - Claude Code
  13. Rules - Cursor
  14. Context Engineering - Simon Willison (on Karpathy)
  15. lm-evaluation-harness - EleutherAI (the etymological root)
  16. Terminal Bench 2.0 - the eval substrate behind the LangChain 13.7-point harness jump
  17. GitHub Spec Kit - spec-driven inner-loop harness
  18. OpenSpec - low-ceremony spec-driven inner-loop harness
  19. Kiro - AWS spec-driven IDE harness
  20. BMAD-Method - multi-agent spec-driven workflow
  21. Tessl - spec-as-primary-asset harness
  22. TDD-Guard - hook-based policy enforcement on Claude Code
  23. autoresearch - Karpathy's signal-driven outer loop for ML experiments
  24. Agentic Stress Testing and Code-Fixer Feedback Loop - jbc22 on r/ClaudeAI (the canonical small signal-driven outer loop)
  25. Karpathy CLAUDE.md (the rules-only anti-example)