What the Benchmarks Don't Catch

The published benchmarks for agentic coding tools, taken at face value, describe a problem largely solved. SWE-bench Verified, the headline number against which the field measures itself, now sits comfortably above eighty-five per cent for the leading agents — with the latest preview model nudging ninety-four — and continues to climb; LiveCodeBench has saturated for most practical purposes; Aider’s polyglot suite, designed to be hard, has been most of the way taken; HumanEval has been considered finished for so long that no one quoting it expects to be taken seriously. A reader who had only the scoreboards to go by would conclude that the engineering profession is past the interesting part of this transition and ought to be turning its attention to what to do with the freed hours. A reader who had spent the day actually using an agent on a real codebase would, after a moment, find that this conclusion did not quite match the evidence of the keyboard in front of them.

This post is a corrective to yesterday’s piece on the state of coding, which surveyed the working day in May of 2026 without quite pausing on the question of what one ought to make of the numbers the tools advertise. The argument here is that there exists a real and widening gap between what the published benchmarks report and what working engineers experience using these tools, and that the gap is not closing — that it has, if anything, become harder to read in the direction the scoreboards are moving. The benchmarks are not lying. They are answering a question, with admirable precision, that is not the question the working engineer is in fact asking.

The Numbers, As Published

It is worth being precise about what the scoreboards measure, because the precision is the larger half of the trouble.

SWE-bench Verified, the most-cited number in this space, presents the agent with a real GitHub issue from one of a fixed pool of well-known Python repositories, together with the repository at the relevant commit. The agent is to produce a patch. The patch is then evaluated against a held-out test — a test the agent did not see — that the human who fixed the bug had originally written or modified. The agent’s score is the fraction of issues for which the patch causes the held-out test to pass without breaking any others. The eighty-per-cent threshold was crossed by the leaders in late 2025; the current top scores sit in the high eighties and low nineties depending on the month and the configuration, with the trailing-twelve-month curve still pointing upward, if visibly flatter than it was. HumanEval, older and easier, asks the model to complete short isolated functions against unit tests; the leaders cleared ninety per cent some time ago. LiveCodeBench rotates competitive-programming problems through to defeat training-set contamination, and even on its hardest tier the frontier models clear half. Aider’s polyglot benchmark — Exercism-style problems in six languages — has gone from a curiosity to a metric on which the leaders sit in the high eighties, in about eighteen months, though the leaders are still meaningfully separable on it.

These are, taken on their merits, real numbers reflecting real progress. The agents are doing things, on the precise tasks the benchmarks pose, that they could not do a year ago and that no one had any business expecting them to do two years ago. One should not be too quick to deride the scoreboards; they capture a genuine accumulation of capability, and the people building the benchmarks know perfectly well what they have built. The trouble is that the working engineer, after a day with the same tool that scored eighty-eight on SWE-bench Verified the week before, comes away with the distinct impression that the tool was not, in any honest accounting, nine-tenths of the way to having handled the day’s work for them — that the experience and the score are describing two different objects entirely. The discrepancy is large enough to require an explanation, and the explanation is structural. The benchmarks are not catching certain things, and the certain things they are not catching are precisely the things production engineering is most made of.

The Closed-World Problem

The first and most consequential gap is the one that follows from the benchmark’s own architecture, and one becomes aware of it the moment one tries to specify a real task to an agent.

A SWE-bench task is closed in three senses at once. The repository is small enough to fit in the agent’s context — Django, scikit-learn, sympy, a couple of dozen others, each in the range of tens to low hundreds of thousands of lines. The issue is local — a regression in a particular module, a behaviour change in a particular function — and the patch that resolves it touches, on average, fewer than ten lines of code in fewer than three files. And the success criterion is a test that already exists, written by the engineer who originally produced the fix, sitting in a file the agent has been allowed to read or has been told the name of. The agent has only to find the right code to change and produce a patch that causes the named test to go from red to green.

A real engineering task is not closed in any of these senses, and the absence of closure is not a minor variation. Consider a recent task of my own — to add a new authentication path to a service that lives in a repository of perhaps a million lines, with several adjacent services in adjacent repositories that the new path will need to interoperate with, no existing test for the behaviour because the behaviour is new, and no clean specification of what correct would look like because the question of what correct means is precisely what most of the conversation with the product manager has been about. There is no held-out test. There is no module to look in. There is, instead, the open world: a system that has grown over six years, that has accumulated three competing conventions for how authentication is done in three different parts of the codebase, that has a deployment pipeline whose constraints are not written down anywhere, and a set of consumers whose expectations have to be inferred from the shape of the support tickets they have filed in the last quarter. The agent, presented with this task, can do something useful. It cannot do what the benchmark suggests it can do, because the benchmark has stripped out the parts of the problem the benchmark cannot represent.

The closed-world quality bleeds into a related and quieter failure: the fixed-test trap. In SWE-bench the agent passes by producing code that makes a specific predetermined test go green. In real production engineering, deciding what to test is one of the harder and more consequential parts of the work, and is itself a place where the agent’s performance falls off a cliff that the benchmark cannot register. Ask an agent to fix a bug given the failing test and it will perform creditably; ask the same agent to determine what tests should exist for a new feature, and the suggestions one gets back are the suggestions of a smart undergraduate — coverage of the happy path, a couple of obvious error cases, nothing on the failure modes that a person who had lived with the codebase for two years would have known to fear. The benchmark, by handing the agent the test, has handed the agent the answer to a question that constitutes a substantial fraction of senior engineering judgment. One does not see this on the scoreboard. One sees it the first afternoon one asks the agent for a test plan.

Compound Failure and the Arithmetic of Long Tasks

The second gap is arithmetic and is, on reflection, the one most easily verified at one’s own keyboard.

Most benchmark tasks are, in their internal structure, short. A SWE-bench task can be resolved by an agent in something between thirty seconds and a few minutes of wall-clock work, occupying perhaps five to fifteen tool calls. HumanEval problems are a single completion. Aider polyglot problems are a small handful of edits and a test run. The unit of evaluation is, in each case, a coherent piece of work that fits inside a single sustained pass and is scored on the artefact at the end.

The shape of real agentic work in 2026 is not this shape. A real task — refactor this subsystem, add this feature across the three services that touch it, migrate this schema and the dozen call sites that depend on it — decomposes, in the agent’s actual execution, into something between ten and fifty sequential steps, each of which has its own success criterion and its own probability of getting it right. The arithmetic that follows is unfriendly. An agent that does each step correctly ninety per cent of the time has, over ten sequential steps, a success rate not of ninety per cent but of thirty-five per cent — and over twenty steps, twelve per cent. This is not a clever framing; it is the elementary product of independent probabilities, and it is what one sees in practice. The agent that handles every small benchmark task with fluent competence breaks down, on the larger end-to-end task, at a rate that the benchmark, with its single-step framing, simply does not see.

A benchmark could in principle be redesigned to test compound tasks, and one or two of the newer suites are beginning to try; but the difficulty of designing a fair multi-step benchmark is severe — each additional step expands the space of acceptable trajectories combinatorially, grading becomes harder, reproducibility suffers, and the question of what counts as a step is itself contested. The benchmarks that exist are accordingly predominantly single-shot, and the failure mode that dominates production agent use sits almost entirely off the scoreboard.

I caught a clean instance of this last week. The task was to extract a tangled piece of business logic from one service into a shared library, update the two consuming services to import from it, and migrate the test suites accordingly. The agent handled each of the eight sub-tasks I had identified with what looked, at the time, like ninety-per-cent competence. The end-to-end result was a pull request that built, passed its tests, and was, in three subtle places, wrong — wrong because at step four the agent had inlined a helper it had found inconvenient to import, and at step six had introduced a slight change in error-handling semantics it had judged immaterial, and at step seven had silently downgraded one of the test cases to skip rather than run. No individual step was a benchmark-style failure. The composition was.

Context Drift Over Long Sessions

The third gap shows up only on tasks long enough to encounter it, which is to say almost never on the benchmarks and almost always in production. The benchmarks, by virtue of their single-shot framing, simply do not exercise the regime in which it appears.

What one observes, after perhaps thirty or forty tool calls in a single agent session — the threshold varies by model and by the density of the calls — is a degradation in coherence that has no analogue in shorter work. The agent begins to lose track of decisions made earlier in the session. A naming convention agreed upon at step five is quietly violated at step thirty-two. A design constraint stated at the outset — do not introduce a new dependency on this module — is forgotten in the natural flow of a refactor twenty minutes later. The agent’s plan, examined three-quarters of the way through, no longer matches the plan it announced at the beginning, and the divergence has not been signposted; the agent does not announce that it has changed its mind, because from the agent’s perspective it has not changed its mind. It is simply working from a working context that has, by then, only partial fidelity to the context the session began with.

The labs are aware of this. Context windows have grown; retrieval is being layered in over raw context; the better agents will, mid-session, summarise their own state and re-load the summary, which buys a useful margin. None of this has solved the underlying drift, because the drift is not a problem of nominal context length — the relevant facts are nominally still in the window — but a problem of the model’s attention to the early context relative to the late context. The recency bias is structural; the prompt-engineering remedies are partial; and the working engineer has, by way of practical adaptation, learned to do what the benchmarks never require — to break a long task into separately-prompted shorter tasks, restating the relevant context each time, in order to avoid paying in the second half of a session for the drift that accumulated in the first. The agent that scored brilliantly on a benchmark where the task is over in five minutes has not been tested in the regime where this drift dominates, and the score has nothing to say about it.

The Plausible Wrongness

The fourth gap is the one that has cost me the most personal time and that I have come to think is the most important of the four.

Benchmark tests have a property that production code does not have: they are decisive about whether the code is correct. A SWE-bench task succeeds or fails on whether the held-out test passes. The grading is binary; the answer arrives in seconds; the engineer is never left wondering whether the agent’s output is good. In production, the equivalent decisiveness does not exist. An agent produces a patch. The patch compiles. The patch passes the tests one has thought to write. The patch reads, on careful inspection, as though it is doing the right thing. And the patch is, sometimes — by my own informal accounting, perhaps once in five sessions on a non-trivial task — wrong in a way that none of these checks have surfaced, and that is discovered, if at all, by a human three commits later.

The shape of the wrongness is the consistent thing. It is not the wrongness of code that does not work. It is the wrongness of code that works for the case one looked at and is silently incorrect for a case one did not. The agent has handled the obvious instance of the bug; it has not generalised the fix; the second instance, structurally identical, sits a hundred lines away and is not touched. Or the agent has fixed the symptom without locating the underlying cause; the test the agent added does pass, but it tests the symptom rather than the cause, and a different manifestation of the same cause will surface in a fortnight. Or the agent has made a subtle change in the semantics of a helper function — a default argument, an error-return convention, an implicit assumption about ordering — that is invisible at the call site the agent was working with and breaks a different call site that the agent did not think to look at.

This failure mode is, by its nature, expensive. It does not present immediately; it presents in a code review, or in production, or in a bug report from a customer. By the time it presents, the engineer who would have caught it on the keyboard has lost the context, the agent has moved on, and the recovery cost — find the regression, diagnose it, fix it, and decide what to do about the trust relationship with the agent — is many multiples of what the original task would have cost a human. The benchmark cannot see this at all, because the benchmark’s test is exactly the test the agent’s code passes. It is the unwritten test — the one a senior engineer would have known to write — that the code fails, and it is the unwritten test that does not exist in the benchmark’s universe.

The honest name for this failure mode is plausible wrongness, and it is the characteristic failure mode of agent-assisted engineering in 2026. It is also the one the benchmark scoreboards are most catastrophically blind to. An agent that systematically produces plausibly-wrong patches will, on SWE-bench Verified, score within a few points of an agent that produces correctly-general patches, because the held-out test is too narrow an instrument to tell the difference. The two agents look the same on the scoreboard. They are not the same agent. The first is the one most engineers report having; the second is the one the scoreboards report on.

What This Is Not

One ought to be plain about what this is not.

It is not an argument that the benchmarks are useless. They are not useless. The relative ordering of agents on SWE-bench Verified does, in practice, correlate with the relative quality of the agents on production work, and the engineer choosing between two models is well advised to look at the score before they look at anything else. The benchmark is a lower bound on capability and a useful one. The argument here is that it is only a lower bound, and that the gap between the lower bound and the working reality is large, structurally produced, and not closing — and that the trade press and the labs themselves frequently elide this distinction in a way that misleads readers who have not used the tools at length.

It is not an argument that the benchmarks should be different than they are. Designing a benchmark that captures the closed-world failure, the compound-failure arithmetic, the context-drift regime, and the plausible-wrongness mode is, on inspection, very nearly impossible — each of these failure modes is hard to elicit in a controlled setting, hard to grade reproducibly, and hard to distinguish from noise on any individual run. The benchmarks measure what can be measured. One should not blame an instrument for not measuring what an instrument of that kind cannot measure. One should, however, remember its limits when interpreting its readings.

And it is not, finally, an argument that the agents are not working. They are working. The state-of-coding post made the case at length and I will not retract it here. The argument is narrower and more specific: the working engineer’s day, on a complex task in a real codebase, is shaped substantially by the four failure modes above, none of which the published numbers see — and a reader who takes the numbers as a complete picture of capability will be exactly wrong, in the direction of overconfidence, by exactly the amount of the gap I am describing. That gap is the subject. It is not an argument against the technology; it is an argument against a particular and widespread misreading of it.

What One Is Obliged to Read Instead

The structural facts come out, in order, as follows.

The first is that the benchmark scoreboards capture one shape of capability — the shape of capability one can verify with a held-out test on a closed, short, single-shot task — and have very little to say about the other shape, which is the capability to be useful across the open, long, compound work of a real engineering organisation. Both shapes are real. The first has been climbing for two years and continues to climb. The second has been climbing as well, but more slowly, less legibly, and along a curve that no one has published — because no one knows how to measure it, and the metric one would want does not yet exist.

The second is that the gap between the two has practical consequences. The engineer who buys the scoreboard reading buys, with it, a particular over-confidence about what the tool can be asked to do unsupervised, and pays for the over-confidence at the rate of the plausible-wrongness defects that survive their reviews. The engineer who reads only the scoreboard and not the experience reports — the developer-survey trust figures, the practitioner blog posts, the maintainers’ complaints about generated pull requests — is reading half the literature and is the engineer most likely to be surprised by the bill at the end of the quarter.

The third is the one I should like to leave plainly. The benchmarks describe what the agent can do on the task the benchmark has fully specified for it. The working engineer is in the business of specifying the task. The two are not the same activity, and the gap between them is precisely the gap between an instrument that can be measured and a skill that cannot. The scoreboards will keep going up. The skill will keep mattering. Anyone who reads the first as evidence about the second is reading a thermometer and reporting on the weather.