Agentic Development · · 11 min read

Harness Engineering: The Third Pillar of Agentic Coding

TL;DR

The community’s framing, popularised by LangChain and picked up by Fowler: Agent = Model + Harness. Agentic coding is software development done by those agents. The model reasons. The harness is everything around it: the runtime, the tools, the rules, the skills, the hooks, the memory, the scheduled agents. Harness engineering is the discipline of building that “everything around it” well. Most teams invest in the model, a bit in the runtime, and almost nothing in the rest. That gap is where reliability lives, and where this article is aimed.


The Three Pillars of Agentic Coding Three columns. The first two labeled MODEL and RUNTIME under the caption "off the shelf". The third column, HARNESS ENGINEERING, is highlighted in acid green and labeled "your IP". 01 · OFF THE SHELF MODEL Opus · GPT · Gemini the reasoning engine COMMODITISED 02 · OFF THE SHELF RUNTIME Claude Code, Cursor, Codex, Aider, Cline COMMODITISED 03 · YOUR IP HARNESS ENGINEERING identity, rules, skills, hooks, memory, cron built by you, for you COMPOUNDS · VERSIONED · YOURS

The three pillars

The community’s standard framing, coined by LangChain and extended by Fowler, is: Agent = Model + Harness. In the context of software development, those agents are coding agents: they plan, write, test, and iterate on code using development tools. The model is the reasoning engine. The harness is everything else: the runtime that wraps it, the tools it can call, the context it reads, the scripts that run around it.

For the sake of a useful picture, I’ll split the harness into two pillars. There’s the part you pick off the shelf, and the part you build yourself:

  • Model: Opus, GPT, Gemini. The reasoning engine. You pick one; you swap it when something better lands. Increasingly commoditised.
  • Runtime: Claude Code, Cursor, Codex, Aider, Cline. The harness you install: how the model uses tools, what files it can see, how the chat loop behaves, how it plugs into your IDE. Also increasingly commoditised.
  • Harness engineering: the harness you build. Identity, rules, skills, hooks, memory, scheduled agents. Version-controlled, reviewed, specific to your codebase and your team. Your intellectual property. The thing that doesn’t commoditise, and the thing most teams barely start on.

(Purists will note the runtime is technically part of the harness too, and they’re right. I’m splitting it visually because that’s where teams actually invest differently: most do the first two well and skip the third.)

A well-engineered harness makes all the difference between repeating yourself dozens of times and working with someone who already knows your project and your coding standards. The model doesn’t change. The runtime doesn’t change. What changes is everything the model reads before it starts.

The rest of this article is about the third pillar. The layout I’ve landed on: six layers, what each one is for, and a Monday-morning checklist.


Why harness engineering matters more than I first thought

A few things I keep running into, in my own work and when I look at how other people have set things up:

1. Most “the AI is bad at our codebase” problems are harness problems.

Whenever an agent setup is missing the mark, the first place I look is the AGENTS.md file (or CLAUDE.md). It’s almost always empty, missing, or a two-line placeholder nobody has touched in months. The model isn’t bad at the codebase; nobody told it what the codebase is, who the team is, which tests to run, or which parts are off-limits. We wouldn’t onboard a human engineer by pointing them at a monorepo and saying “figure it out.” But that’s exactly the default experience with an AI agent, and then we’re surprised when it underperforms.

2. Prompt quality has a ceiling. A good harness compounds.

A perfectly crafted prompt buys you one good turn. A well-engineered harness buys you every turn, forever, across the whole team. Harness changes version like code: someone proposes, someone reviews, someone merges. Prompts fade into chat history.

3. The harness is where your team’s judgment actually lives.

“We always run migrations behind a feature flag.” “All PR titles start with the Jira ticket.” “Never edit the generated protobuf files.” “Auth changes require a security review.” Every team has dozens of these. They usually live in one or two people’s heads, and new hires learn them by breaking them. A well-engineered harness is where those rules become version-controlled files that the model reads before every change, which also, incidentally, makes them discoverable for humans.


The anatomy of a well-engineered harness: six layers

The Six Layers of Harness Engineering A vertical stack of six horizontal bars. Top to bottom: Identity, Rules, Skills, Hooks, Memory, Scheduled Agents. Each layer has a number, name, role, and path. Layers 1 and 2 are tagged "always loaded"; layers 3 through 6 are "on demand". LAYER NAME ROLE PATH 01 Identity who the agent is, how the system is organised AGENTS.md 02 Rules must never / must always, regardless of task .claude/rules/ 03 Skills callable workflows the agent invokes by name .claude/skills/ 04 Hooks deterministic scripts at lifecycle points .claude/hooks/ 05 Memory curated + searchable knowledge memory/ 06 Scheduled agents cron runs without a human in the loop cron/

Any harness worth the name has six layers. The names and file paths vary between runtimes (Claude Code, Cursor, Codex, Aider; most of them now read the AGENTS.md open standard), but the roles are universal.

Layer 1: Identity

What it is: A single always-loaded file that tells the agent who it is, who it’s working for, and how the system is organised.

What goes in it:

  • The system’s purpose in two sentences
  • Where to find everything else (links to the other five layers)
  • Top-level directory map
  • How to operate: search memory before answering, check for existing workflows before starting tasks, read errors before retrying

What stays out of it: anything that changes more than monthly. Project-specific context, client details, recent decisions. Those belong elsewhere. If your identity file grows past ~200 lines you’re putting the wrong things in it.

Template:

# AGENTS.md (Claude Code reads CLAUDE.md, the rest read AGENTS.md)

## Identity
You are the engineering agent for <team>, a <stack> codebase focused on <domain>.
Follow <voice-or-tone-file>.md at all times.

## Architecture
- <component-1>: purpose, location
- <component-2>: purpose, location
- <component-N>: purpose, location

## How to operate
1. Memory first: search before answering from training data
2. Find existing workflows before starting a task: check <skills-folder>
3. Check existing tools before writing scripts: read <tools-manifest>
4. When stuck: explain what's missing. Don't guess.

## On-demand context (not always loaded)
| If the topic is... | Read... |
|---|---|
| Authentication | docs/auth.md |
| Database migrations | docs/db.md |
| Writing Go code | memory/coding-standards/go.md |
| Writing React code | memory/coding-standards/react.md |
| ... | ... |

Why it matters: the identity file is the one thing loaded on every turn. It sets the defaults for everything else. Get this right and the model behaves predictably; get it wrong and no amount of prompting saves you.

Layer 2: Rules

What it is: Short, auto-loaded safety and protocol files. Things the agent must never do, or must always do, regardless of the task.

What goes in it:

  • Destructive-action guardrails (rm -rf, force-push, drop tables; always confirm)
  • Security rules (never commit .env, never log credentials)
  • Team conventions that are inviolable (PR title format, branch-from-dev, never merge your own PRs)
  • Voice boundaries (when the agent writes as you vs about you, especially if the agent drafts outbound communication)
  • The few coding standards that are truly inviolable: no any in TypeScript, all SQL parameterised, no console.log in production code

What stays out: preferences. “We usually prefer…” is not a rule, it’s guidance and belongs in a skill or in memory/coding-standards/. Rules are the list of things that cause incidents when violated. The bulk of your coding standards (naming, file layout, idioms, patterns) are guidance, not rules: they live in memory and lazy-load when the agent is actually writing code in that area.

Template:

.agent/rules/
├── guardrails.md         # Destructive actions, security, data integrity
├── conventions.md        # PR format, branch rules, merge rules
├── voice-boundaries.md   # Internal tone vs external artifacts
└── knowledge-index.md    # Topic to file pointer map

Why it matters: rules are cheap insurance. The day the agent is about to force-push to main, the rule file is what stops it. Treat this layer like your linter config: small, enforced, reviewed.

Layer 3: Skills

What it is: Self-contained workflows the agent can invoke by name. A skill is a directory with a SKILL.md (instructions) and optional scripts/ (deterministic helpers).

What goes in a skill:

  • When to use it (trigger conditions)
  • Step-by-step process
  • Scripts to call at specific steps
  • Output format

Good skills in a typical codebase:

  • ship-pr: branch from dev, run tests, draft PR body, assign reviewers
  • review-pr: pull diff, check against coding standards, post structured feedback
  • run-migration: generate migration, dry-run, review, apply, rollback plan
  • incident-response: gather logs, draft postmortem skeleton
  • new-endpoint: scaffold route + tests + docs + OpenAPI entry

Why it matters: skills turn tribal knowledge into callable functions. Instead of a new engineer asking “how do we ship a PR here?” they (or the agent on their behalf) invoke /ship-pr. The workflow is the documentation. It can’t drift because it’s executed, not described.

Template directory:

.claude/skills/
├── ship-pr/
│   ├── SKILL.md
│   └── scripts/run-tests.sh
├── review-pr/
│   ├── SKILL.md
│   └── scripts/post-comment.py
└── ...

Skills are the layer where cross-runtime portability is real. The SKILL.md format is now an open standard at agentskills.io. It was originally Anthropic’s, now adopted by Cursor, Codex CLI, Windsurf, Goose, OpenCode, Gemini CLI, Junie, Factory, GitHub Copilot, and Claude Code. The cross-tool convention is .agents/skills/<name>/SKILL.md (note the plural). Together with the AGENTS.md identity standard, skills are one of two layers where the same files actually run unmodified across runtimes. Rules and hooks remain runtime-specific.

Layer 4: Hooks

What it is: Deterministic scripts the agent runtime executes at specific lifecycle points. Pre-commit, post-tool-use, session-start, session-end.

What hooks are for: things that must happen, not things the agent might choose to do. The model decides; the hook enforces.

Common useful hooks:

  • SessionStart: inject up-to-date context (today’s date, current branch, active issues). Stops the model hallucinating what day it is.
  • PreToolUse on Bash: scan the command for dangerous patterns (rm -rf /, force-push, DROP TABLE). Block or require confirmation.
  • PostToolUse on Edit: run the formatter. You’ll never have another debate about tabs vs spaces.
  • SessionEnd: auto-save memory, log the session, archive artifacts.

Why it matters: agents are probabilistic. Hooks are not. The split between LLM reasons and script executes is the single most important architectural call in your harness. Your agents get more reliable the moment you stop asking the model to do deterministic things and start asking the hooks to do them.

Layer 5: Memory

What it is: Persistent knowledge that outlives a single session. Two kinds: structured and searchable.

Structured memory is a set of curated markdown files. Always-loaded ones (identity, team context, current work) live at the top of the pyramid. On-demand ones (per-project context, coding standards, research summaries, preferences) are referenced by the identity file and loaded when relevant.

Searchable memory is a vector store: embeddings over the structured files plus anything else worth indexing (chat history, decisions, past PRs). The agent searches semantically before answering: “have we discussed this? made this decision? hit this problem before?”

Template structure:

memory/
├── MEMORY.md             # Always loaded, curated top-level facts
├── ACTIVE.md             # Always loaded, current work status
├── coding-standards/     # Lazy-loaded, only when writing code in that area
│   ├── go.md
│   ├── react.md
│   ├── sql.md
│   └── README.md         # Index of which standard applies where
├── preferences/          # On-demand, team working styles
├── projects/             # On-demand, per-project context
└── research/             # On-demand, topic deep-dives

Coding standards belong in memory, not in rules. A team’s TypeScript conventions, React component patterns, SQL idioms, Go style notes can run to dozens of pages each. Loading all of it on every turn is wasteful and counterproductive: the agent answers a one-line refactor and arrives carrying your full style guide. Lazy-load instead. The identity file points at memory/coding-standards/ with a one-line index of which file matches which area, and the agent pulls the relevant one only when actually writing code there. The handful of coding standards that are inviolable (security-critical patterns, “no any in TypeScript”) get promoted to rules; everything else stays in memory.

Rules that keep memory useful:

  • Write on purpose, not by accident: memory doesn’t self-update. Wire a /save-memory skill, a /wrap-up skill, or a SessionEnd hook to push new facts back before the session closes.
  • Check before writing: deduplicate. Memory bloat is worse than no memory.
  • One authoritative location per fact: pointers from elsewhere, never copies.
  • Expire aggressively: if a fact is outdated, remove it. Stale memory is worse than no memory.
  • Human-curated top, machine-searchable bottom: the pyramid pattern.

Why it matters: without memory, every session starts cold. With good memory, the agent has the context of a colleague who’s been on the team for a year. This is the single biggest quality-of-life upgrade after the identity file.

Layer 6: Scheduled agents

What it is: cron jobs that run the agent without a human in the loop. Morning briefing, end-of-day summary, weekly review, incident watchdog, dependency audit.

What makes a scheduled agent worth writing:

  • Runs on a predictable cadence
  • Produces a single clear output (a report, a filed issue, an email, a dashboard update)
  • Has a well-defined stop condition (doesn’t drift into open-ended exploration)
  • Is cheap to rerun if it fails

Examples that earn their keep:

  • PR triage: every 4 hours, summarise new PRs, flag anything urgent
  • Dependency watch: weekly scan for CVEs, file an issue per finding
  • Flaky-test reporter: nightly, scan CI runs for newly flaky tests, file an issue per occurrence
  • Release-notes drafter: on every merge to main, draft release notes from PR titles and labels
  • Stale-branch sweep: weekly, list branches with no commits in 30+ days and notify their authors
  • Engineering weekly: Friday afternoon, summarise what shipped, what’s stuck, what’s queued for next week

What stays a chat agent: anything that needs judgment calls with human context. Don’t cron-schedule things where “it depends” is the right answer.

Why it matters: scheduled agents flip the relationship. Instead of opening a chat window and asking for help, the agent shows up at the right moment with the right information already prepared. The economics shift too. A scheduled agent running once a day at around $0.20 per run costs about $70 a year. A human doing the same triage, fifteen minutes a day, is easily $10k a year, and it’s not the kind of work anyone wants to do.

The fully-automated end of this spectrum, “dark factory” agents that plan, implement, review, and merge without human intervention, deserves its own treatment. This article stops at scheduled reporters and watchdogs: humans still read the output and decide what to do with it.


Putting it together: the reference layout

Any runtime, any stack, roughly this shape in the repo:

Reference Harness Directory Layout A dashed repository container labelled "your-repo/" wraps six cards arranged in a 3 by 2 grid. Each card shows a layer number, the layer name, the directory path, and one or two example files. your-repo/ LAYER 01 Identity AGENTS.md CLAUDE.md (mirror, same content) LAYER 02 Rules .claude/rules/ guardrails.md voice-boundaries.md LAYER 03 Skills .claude/skills/ ship-pr/SKILL.md review-pr/SKILL.md LAYER 04 Hooks .claude/hooks/ session-start.sh pre-bash.sh LAYER 05 Memory memory/ MEMORY.md, ACTIVE.md coding-standards/ (lazy-loaded) LAYER 06 Scheduled agents cron/ pr-triage/ dependency-watch/

your-repo/
├── AGENTS.md                       # Layer 1: Identity (open standard)
├── CLAUDE.md                       # One-liner: @AGENTS.md (Claude Code reads this)
├── .claude/
│   ├── rules/                      # Layer 2: Rules
│   ├── skills/                     # Layer 3: Skills
│   ├── hooks/                      # Layer 4: Hook scripts
│   └── settings.json               # Wires hooks to lifecycle events
├── memory/                         # Layer 5: Memory
│   ├── MEMORY.md
│   ├── ACTIVE.md
│   ├── coding-standards/           # Lazy-loaded by topic
│   └── projects/
└── cron/                           # Layer 6: Scheduled agents
    ├── pr-triage/
    ├── dependency-watch/
    └── flaky-test-reporter/

Directory names vary by runtime. Claude Code reads CLAUDE.md and .claude/; Cursor reads AGENTS.md and .cursor/rules/; Codex CLI reads AGENTS.md and .agents/skills/ (note the plural). AGENTS.md is the cross-tool identity standard, stewarded by the Linux Foundation’s Agentic AI Foundation and supported natively by Codex, Cursor, Windsurf, Cline, Jules, Junie, and 30+ others. Aider supports it via opt-in (read: AGENTS.md in .aider.conf.yml). Claude Code reads CLAUDE.md instead, so the common pattern is a CLAUDE.md whose only line is @AGENTS.md. The companion Agent Skills standard makes SKILL.md portable across most of the same runtimes via .agents/skills/. Hooks and rules formats are still runtime-specific: same role, different files.


A worked example: the ship-pr skill end-to-end

Here’s what one of the simplest useful skills looks like in practice. Copy it, adapt it, make it your own.

File: .claude/skills/ship-pr/SKILL.md

# ship-pr: open a review-ready pull request

## When to use
User says "ship it", "open a PR", or work is complete on a feature branch.

## Steps
1. Verify branch is not `main`/`dev`. If it is, abort and ask.
2. Run `scripts/run-tests.sh`. If it fails, stop. Fix before continuing.
3. Summarise the diff vs `dev`: files touched, intent, risk areas.
4. Generate PR title using convention: `<TICKET-ID>: <imperative summary>`.
5. Generate PR body from the template below.
6. Open PR against `dev` via `gh pr create`. Never against `main`.
7. Assign reviewers per `CODEOWNERS`. Post the PR URL back to the user.

## PR body template
## What
<1 to 3 bullets on the change>

## Why
<ticket link plus one-sentence motivation>

## Test plan
<checklist>

## Rollback
<one line: revert commit / feature flag / migration reverse>

File: .claude/skills/ship-pr/scripts/run-tests.sh is ten lines of bash that runs your real test suite and exits non-zero on failure. The LLM doesn’t decide whether tests pass; the script does.

What the user experiences: they say “ship it.” Thirty seconds later there’s a PR open against dev, with a title that matches team convention, a body that follows the template, and the right reviewers assigned. If tests failed, the agent stopped and told them why.

What’s notable about this pattern:

  • The skill fits on one page
  • The LLM decides what to summarise and how to describe risk (reasoning)
  • The script decides whether tests passed (deterministic)
  • The rules file (Layer 2) is where “never PR against main” lives. The skill trusts it.
  • Six months later, a team convention changes. You edit one file, reviewed via PR, and the whole team gets the update.

This is the whole game in miniature. Every skill looks roughly like this.


Five failure modes I’ve walked into (so you don’t have to)

A few patterns that keep tripping me up, and that I’ve seen repeated elsewhere:

1. Treating the harness like config, not code.

If your rules and skills aren’t in version control with PR reviews, they’ll rot. One afternoon of inconsistent edits and nobody trusts them anymore. Treat the harness as first-class code. Lint it, review it, write tests against it (yes, you can: evals).

2. Putting everything in the always-loaded identity file.

The model has a context window, not infinite attention. A 3,000-line CLAUDE.md means the model arrives every morning already tired. Move rarely-needed context behind pointers and only load it when relevant. The pattern: identity file is a map; the territory is on-demand.

3. Mocking deterministic work into the LLM.

If you’re asking the model to “generate a valid JSON payload” or “format this as SQL” or “calculate the diff size,” you’re burning tokens on something a script does better for free. The LLM reasons about what to do; scripts execute what was decided. The split is the architecture.

4. No memory discipline.

Agents that save everything remember nothing. A memory store with 500 entries of overlapping, outdated, duplicate facts is worse than none. The model’s search returns noise. Curate ruthlessly. Dedupe on write. Expire on read if stale.

5. Starting with scheduled agents before the rest works.

Cron is the last layer for a reason. If your synchronous agent is flaky, your scheduled one will be flaky-while-you-sleep, which is worse. Get layers 1 to 5 solid for four to six weeks before you point cron at them.


Monday-morning checklist

If you’re starting from nothing, this is the order:

Week 1:

  1. Write a 100-line identity file (AGENTS.md, or CLAUDE.md if you’re on Claude Code): purpose, architecture, how to operate
  2. Add one rules file: destructive-action guardrails plus your three most-violated conventions
  3. Ship one skill for your most repeated workflow (most teams: ship-pr or review-pr)

Week 2:

  1. Add hooks for SessionStart (inject date + branch) and PreToolUse on bash (block dangerous commands)
  2. Create memory/ with MEMORY.md and ACTIVE.md. Start them nearly empty. Then wire an update mechanism: a /save-memory skill the agent runs after learning something worth keeping, a /wrap-up skill at session end, or a SessionEnd hook. Without one, the files stagnate.
  3. Write a second skill

Week 3:

  1. Add memory/coding-standards/ with one file per language or framework you use, plus a one-line index in the identity file pointing to each
  2. Add vector memory over your structured files (any of mem0, LanceDB, or Qdrant + a 50-line script)
  3. Write a third skill

Week 4:

  1. One scheduled agent. Pick the smallest useful one (PR triage is a good starter)
  2. Check in on yourself: how often does the agent get things right on the first try compared to four weeks ago? If it’s not meaningfully better, the problem is in the earlier layers. Fix those before adding more.

Models will keep getting better. Runtimes will keep getting more capable. The part I find myself investing most of my time in, and getting most of the returns from, is the harness I build on top.

If it helps to have something to look at, I’m publishing an open reference harness, agentic-harness, as a fork-friendly example of all six layers, deliberately stripped of any workflow opinions. Fork it, take what fits, ignore what doesn’t. That’s what it’s there for.

For the personal side of the same pattern (specialist personas, memory files, scheduled routines applied to a founder’s week instead of a team’s codebase) see A Week in My AI OS.


FAQ

Q: Does this only apply to Claude Code? The six-layer pattern is runtime-agnostic. Every runtime has an identity file, a way to load rules, somewhere to put procedural knowledge. The standardisation level varies by layer: identity (AGENTS.md) and skills (SKILL.md via agentskills.io) are now genuinely portable open standards. Rules and hooks still have runtime-specific formats: same role, different files. Claude Code is currently the most fully-realised harness ecosystem; the reference repo demonstrates that, with notes on how to translate each layer for Cursor, Codex CLI, Aider, Cline, and Windsurf.

Q: How much effort is this really? Weeks? Months? A functional v1 of all six layers is two to three weeks of part-time work for one engineer. A mature harness is a quarter. You won’t be “done.” Harnesses evolve with the codebase.

Q: Who on the team should own the harness? Treat it like tooling. Usually a senior engineer or a small platform-ish group. Everyone contributes rules and skills via PR; one or two people own coherence.

Q: Can I retrofit this into an existing codebase with a runtime already in use? Yes, and you should. Start with the identity file and one rules file. You’ll get immediate lift before you’ve even written a skill.

Q: What about secrets and proprietary knowledge? Same rules as any repo. Secrets in .env, never committed. Proprietary context (client names, internal architecture) in a private harness; generic patterns in a public one if you want to share. A useful split is to keep the structure (rules, skills, hooks) public and the memory (project context, decisions, client-specific facts) private. Many teams already do this with their own code, and the same pattern works here.

Q: Do I need a vector store for memory? Not on day one. Structured markdown plus grep gets you surprisingly far. Add a vector store when you notice yourself (or the agent) failing to find things that clearly exist.

Q: How do I know my harness is working? Two signals. First: the rate at which the agent gets things right on the first try goes up. Second: new team members ramp faster because the harness documents the team’s real conventions, not the aspirational ones in the wiki.