ArchitectureJun 16, 2026

Is the harness now a product?

Duane Grey

AI Strategy & Implementation

Find this useful? Share the visual.

The Short Version

When a model writes the code, the engineering job moves up a level, from typing the code to building the system that checks it. That system is the harness, and how well its checking layer is built decides whether "the loop did it" also means "the loop did it correctly." Before you take your hands off, name what your checks actually catch, because a loop that can build even the next model is only as trustworthy as the net around it.

Close to it. When a model writes the code, the engineering job moves up a level, from typing the code to building the system that checks it. That system is the harness, and how well it is built decides whether "the loop did it" also means "the loop did it correctly."

Boris Cherny, who created Claude Code, described his own version of this in late 2025. The line got quoted everywhere. "I don't prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops." He has also said coding is the easy part, with the hard work being infrastructure, listening to users, and getting feedback. He reported going a month without opening an editor while the model wrote the code across 259 pull requests.

The reaction split into two camps, and both walked past the answer in between. One heard "coding is solved" and started pushing unreviewed code through autonomous loops with the controls turned off. The other pushed back hard, pointing at the bugs that still ship and the developers getting burned out as proof that nothing is solved. The first is selling a slogan with the engineering deleted. The second is right that people are getting hurt and wrong about why.

Neither side names the layer the work moved to.

The four layers, and where the effort lives now

The way people get useful output from these models has climbed a ladder over about four years, and the engineering effort climbed with it.

Prompts came first, roughly 2022 through 2024. The question was what to say to the model, and the early wins came from better phrasing. That ran into a wall. Perfect wording still fails when the model cannot see the information it needs.

Context engineering was the next rung, through 2025. The question became what information to put in front of the model. Karpathy described it as filling the context window with the right information for the next step. The work moved from phrasing to deciding what the model gets to see.

The harness is where the effort sits now. A clean definition going around is that the harness is everything in the agent except the model. It is the orchestration, the error recovery, the guardrails, and the part that checks the work.

The loop is the harness running. The agent acts, looks at the result, reasons about whether it is closer to done, and decides whether to keep going. A goal is what you point the loop at, a finish line plus some boundaries, with the agent free to pick its own route there.

Harness, loop, and goal describe three parts of one thing. The harness is the machine, the loop is it running, the goal is the target. When Boris says his job is writing loops, he means his work climbed to the top of that ladder. The slogan strips out the three rungs underneath that make the top one safe to stand on.

What the checking layer catches

The center of a harness is how it checks the agent's work. Those checks come in two kinds that are not equal.

The first kind is deterministic, the compilers and type checkers and linters and tests that hand back a binary verdict with nothing to negotiate. The code compiles or it does not. The suite passes or it fails. A machine renders the call.

The second kind is inferential. A separate model reads the output and judges it, usually against a written rubric. People call this using a model as a judge, where one grades another instead of a compiler doing it. It can evaluate things a test cannot phrase, like whether a change matches what you asked for. It is also softer, since the right wording can slip past a judgment call.

Deterministic checks catch whether the code compiles, types, and passes the tests you wrote. What they miss is code that runs cleanly, clears the suite, and is still wrong against what you meant. A function can be green and incorrect at the same time. Tests cover the cases you thought to write, and no more.

That difference, between "it passed" and "it is right," is the gap between what you meant and the approximation that satisfied the test.

Speed versus prove it correct

The argument people frame as speed versus correctness is an argument about that difference.

The speed camp bets that tests plus a model judging the output are a good enough net to take your hands off and run many agents at once. The verification camp, the engineers pushing formal proofs a machine can check, bets that tests plus a judge miss the errors that matter most, the ones about meaning rather than mechanics, and that you need something stronger before you trust the result.

Both can be right depending on the work. A marketing site and an avionics module do not need the same net. "Coding is solved" holds to the degree your checks can prove it. The slogan does damage because it does not say how to build the net, or that the net changes with what you are building, and assumes you already know this.

What getting a loop wrong costs

The advice to just run loops tends to come from people whose tokens are effectively free. For everyone else, the loop has a price, and getting it wrong is where that price shows up. One pointed at a fuzzy finish line burns money on its way to the wrong answer. Before that comes the cost of learning to build one for your particular project, which the slogans skip. Past both sits the skill the whole thing rests on. You have to turn what you pictured into instructions exact enough that the checks can hold the agent to them. That part is closer to writing than to configuration. It is the same gap named earlier, between what you meant and the approximation that satisfied the test.

When the loop builds the next model

The riskiest loop is the recursive one, where a system is used to build its own successor.

In May 2026, Andrej Karpathy joined Anthropic to lead a team using Claude to speed up the training of the next Claude. He had already shown the idea worked. He wired an agent to a small model, let it run unsupervised for two days, and it found optimizations he then applied to a larger model to cut training time. Anthropic reports that more than 80 percent of its merged code is now written by the model, with engineers shipping several times more code than a year ago. Those are the company's own numbers, from people who sell the tool, so weigh them accordingly. For agentic development, the direction matters more than the numbers. The more it writes, including the next version of itself, the more your results depend on the loop wrapped around it.

When the loop writes a feature, a weak checking layer costs you a bug. When it shapes the next model, the evals it has to clear decide whether the new version is genuinely better or just better at gaming them. A loop that builds its own successor is as trustworthy as the checks around it, and no more. The verification layer, easy to treat as plumbing, just became the one with the most riding on it.

The question to ask before you take your hands off

For an engineering lead, or anyone accountable for what ships, the useful move is to look at the net you already trust and name what it catches, instead of picking a side in the speed argument.

When an agent, or a fleet of them, opens a pull request, what stands between that code and production? If the answer is the tests pass and a model reviewed it, then an inferential checker is guarding correctness. For code that touches money, customer data, or anything a security review would flag, that is thin protection.

The teams getting real value out of loops built a net they have read, and they kept a human on the actions the net cannot vouch for. The agent count is not the critical differentiator. Speed is fine once you can say what your checks cover. The risk is handing the work to a net you have not read.

Written by Duane Grey

AI Strategy & Implementation

Independent AI consultant helping companies cut through hype and deploy systems that produce real results.

Follow on LinkedIn Follow on YouTube

Related Insights

Architecture · May 16, 2026What is the right architecture when the AI brain itself cannot be trusted?Security · Apr 28, 2026What happens when an AI monitor has a relationship with the agent it is watching?Architecture · May 13, 2026Why did the reasoning model lose the planning test?