The Plateau Nobody Will Name

There is a question being asked, in private and not yet in print, by a substantial number of the people whose business it is to watch frontier artificial-intelligence models for a living. The question is whether the steady six-monthly cadence of capability improvement — the cadence that has held, with very little interruption, from the appearance of GPT-3.5 in November 2022 through to the present — has, in the past twelve to fifteen months, begun to slow. The question is not whether progress has stopped; almost no one of consequence believes that. The question is whether the rate of progress is what it was eighteen months ago, and whether the next eighteen months will look like the last.

This is a more difficult question than it appears, and a great deal more difficult than the easy answers in either direction suggest. The purpose of the present essay is to take it seriously — to set down what one would have to believe for the plateau case to be correct, what one would have to believe for the continued-cadence case to be correct, and to leave the reader in the calibrated uncertainty the evidence supports. This is an addendum to the trajectory series of earlier this month: the trajectory pieces took the cadence largely as given and reasoned from it. The question of whether the cadence is itself still operating is the question I did not ask there.

The Cadence That Held

The thing whose continuation is at issue is real enough that one should be plain about what it was. From late 2022 through to roughly mid-2025, the major frontier laboratories — OpenAI, Anthropic, Google DeepMind, and a small handful of others — shipped, with remarkable regularity, a new frontier-class model every six months or so. The cadence was not a coordinated industry timetable; it was, rather, the convergent outcome of compute scaling laws, training-run durations, and the commercial pressures of a field in which falling behind by a single release cycle was understood to be commercially serious. GPT-3.5 in November 2022. GPT-4 in March 2023. Claude 2 in July 2023. Gemini 1 in December 2023. Claude 3 in March 2024. GPT-4o in May 2024. Claude 3.5 Sonnet in June 2024. Gemini 1.5 Pro in the same span. The intervals were six months, give or take a month.

What made the cadence legible — what made it, in fact, the organising fact of the field — was that each release was, on the benchmarks then in use, visibly better than the one before. GPT-4 was a generational advance over GPT-3.5, not a refinement of it. Claude 3.5 Sonnet of June 2024 was visibly better than Claude 3 Opus of three months earlier at a great many things — coding, instruction-following, reasoning chains, multimodal handling. The numbers on MMLU climbed; the numbers on HumanEval climbed; the numbers on GSM8K climbed; the field had, in that period, the great convenience of being able to point at a number and say the number went up. One could write an account of frontier-model progress that any general reader could follow, because the progress was happening on axes the general reader could be made to care about.

By the end of 2024, an attentive observer could plot the trend lines and project them into 2025 and 2026 with a fair degree of confidence. The projection was that the cadence would continue, the benchmarks would continue to be passed, and the field’s collective story about itself — capability is doubling on a six-monthly clock — would continue to hold. The trajectory pieces of earlier this month took this projection largely as their working assumption, and so did most of the analyst class.

The Cadence That May or May Not Be Holding

Set against the legible 2022-to-2024 sequence, the eighteen months that have followed it have a curiously different character. The releases have continued. Claude 3.7 Sonnet in February 2025; Claude 4 in May 2025; Claude Sonnet 4.5 in September and Opus 4.5 in November; the 4.6 update of February; the GPT-5 announcement of August 2025; Gemini 2.5 in the spring and Gemini 3 in the late autumn; and, on the open-weight side, an entire second generation of substantial models from Meta, DeepSeek, Mistral, and the Chinese state-aligned labs. By release count, the cadence has not merely held; it has intensified. There are more frontier-class models being shipped per quarter in 2026 than there were in any quarter of the prior three years.

What has changed is the legibility of the improvement between releases. Claude 3.5 to Claude 3.7 was a tangible upgrade on a great many user-visible axes; Claude 3.7 to Claude 4 was tangible but narrower; Claude 4 to Claude 4.5 was, on most benchmarks then current, a single-digit improvement that the headline numbers struggled to dramatise; Claude 4.5 to Claude 4.6 was an improvement that required several pages of evaluation notes to characterise, and that most users, asked to compare the two side by side on their everyday work, were obliged to call a near-tie. The parallel sequence at OpenAI — GPT-4o, the o1 reasoning models, the o4 series, GPT-5, GPT-5.1 — produced, by mid-2026, an almost identical effect: the model names had advanced through a full version, and the headline numbers had advanced by a percentage that no general reader would describe as the same kind of advance the 3.5-to-4 step had been.

There are two readings. The first, which one hears in the safety-and-policy conversation and which the press will not name aloud, is that the rate of underlying capability gain has begun to slow — that we are in the early phase of an S-curve flattening, that the pretraining-scaling regime which produced the 2022-to-2024 advances has hit a regime of diminishing returns, and that the field is living off the residual gains from a different and slower set of techniques. The second, which one hears in the research conversation and which the press will also not name aloud (for the opposite reason), is that the rate of underlying capability gain has not slowed at all — that the cadence is intact, that the gains are being cashed out along axes the benchmarks are not measuring, and that the appearance of slowness is an artefact of the measurement infrastructure rather than a property of the underlying capability.

Both readings have respectable evidence behind them. This is the difficulty.

The Ceiling That Spoils the Picture

The most obvious complication, and the one that makes the simple plateau argument considerably harder to defend than it appears, is benchmark saturation. The principal benchmarks of the leaderboard era — MMLU, HellaSwag, GSM8K, HumanEval, ARC, the BIG-Bench suite — have, every one of them, been pushed by the frontier models to within a percentage point or two of the human-expert ceiling. The benchmark cannot, by construction, signal a gain past its ceiling; a model that scores 95 percent on MMLU and a model that scores 94 percent are operationally indistinguishable, even if one of them would, on a harder version of the test, demonstrate a substantial advantage. The agentic benchmarks that succeeded the leaderboards — SWE-bench Verified, OSWorld, GAIA, the τ-bench family — have, as I traced in the eval post, followed the same arc on a substantially faster clock. SWE-bench Verified, introduced in August 2024 and saturating by mid-2026, ran the full ceiling-pinning cycle in under two years.

The effect of ceiling-pinning on the press narrative is not subtle. A regime in which the headline number is the number that goes up depends, structurally, on a number that has room left to go up. When the number reaches the ceiling, the regime stops producing its characteristic signal, and an observer accustomed to the signal experiences the absence of signal as a slowdown — even if the underlying capability is in fact continuing to advance along axes the benchmark in question never measured. A great deal of the public conversation about a 2025-2026 plateau is, on inspection, conversation about benchmarks that have run out of room rather than conversation about capabilities that have run out of slope.

What the field has done in response — the move to agentic and then to economic evaluation; the design of harder held-out benchmarks (Humanity’s Last Exam, FrontierMath, GPQA-Diamond); the construction of evaluations whose ceiling is the limit of trained-domain-expert performance rather than crowd-worker baselines — is the right response in principle, but it has the practical effect of moving the legibility of progress out of the general-reader’s reach. A six-point gain on FrontierMath does not have the immediate intelligibility of a six-point gain on MMLU; the public has no folk sense of how hard the FrontierMath problems are, and one cannot easily build one without doing the actual mathematics. The field has traded a legible saturating signal for an illegible non-saturating one, and the cost of that trade is most of the public conversation’s ability to track what is happening.

The Tax That Capability Pays

A second complication, quieter, is the way capability gains have been cashed out over the past year. The 2022-to-2024 cadence had a particular character: each new frontier model was both noticeably more capable and noticeably more expensive to run, and the trade was generally judged worth it. From the middle of 2024 onwards, an increasing fraction of the engineering work in the major labs has been the work of holding capability roughly constant while pushing the cost down — sometimes by an order of magnitude per generation. The absolute floor — the per-token cost of a competent open-weight model on commodity hardware — has fallen by perhaps a factor of fifty since the start of the period. Latency has followed a similar curve: the median time-to-first-token on a frontier reasoning model in early 2025 was several seconds; the median in mid-2026, for an equivalent prompt, is well under one.

What this means for the plateau argument is that a substantial part of the engineering yield of the past year has been invisible as a capability gain because it was paid out as a cost gain instead. If the same model that cost ten dollars per million tokens last year now costs one dollar, runs at four times the speed, and is available in a context window thirty times larger, this is a perfectly real piece of progress — it has materially changed what the model can be deployed against — and it does not move any of the capability benchmarks. The latency-and-price story is a story the labs do tell themselves; it is not a story the press finds dramatic, nor one a worried policy analyst would describe as continuing capability progress. The infrastructure tax is, in this sense, a tax on legibility as much as it is a tax on capability.

The Different Slope

The third complication is the one that, in the research conversation, has the most weight, and which the public conversation has the hardest time absorbing. The dominant axis of capability gain in the frontier labs through 2022 to early 2024 was pretraining scale: more parameters, more tokens, more compute, a larger and cleaner training corpus. The scaling laws that governed this regime were unusually well-characterised, and the cadence of capability improvement was, in essence, the cadence at which the labs could secure compute and data to feed the scaling laws. From roughly mid-2024 onwards, the dominant axis has shifted. The marginal capability gain from a further doubling of pretraining compute, at the frontier, has begun to diminish — not vanish, but diminish — while the marginal capability gain from reinforcement learning at scale, particularly RL on reasoning traces and tool-use trajectories, has not.

The agentic post-training regime — the work that produced o1 and its successors at OpenAI, the extended-thinking modes at Anthropic, the deep-search and tool-orchestration capabilities at Google — is a different rate-of-progress regime from the pretraining one. Its gains accrue along axes the pretraining benchmarks were not built to measure: long-horizon task coherence, multi-step tool use, the ability to maintain a coherent plan across an interaction that runs for hours rather than minutes. These are the capabilities one would expect to see show up in agentic benchmarks; they are showing up there, in fact, faster than the pretraining-era benchmarks improved through 2022-2024. SWE-bench Verified climbed from approximately 49 percent in late 2024 to over 80 percent by late 2025 — a single-year movement of more than thirty percentage points, on a benchmark that was supposed to be hard.

What an honest reading must concede is that if the dominant axis of capability gain has shifted from pretraining scale to RL and test-time compute, then the appropriate comparison is not whether the new axis is producing as much benchmark-legible gain per six months as the old one was. The appropriate comparison is whether the deployed capability of the frontier models is continuing to advance along the axes the new regime targets. By that measure — by long-horizon coding agents, by multi-hour autonomous research tasks, by the operational integration of vision and language and tool use — the deployed capability is advancing at a rate that does not appear to have slowed. It may, in fact, be advancing faster than it was in 2023, on the axes that now matter for the work the models are being put to. The plateau, on this reading, is not a plateau at all; it is the surface effect of a measurement regime that hasn’t caught up to where the work moved.

The Things That Did Improve

It is worth saying plainly what, in the past twelve months, did improve in a way one could not honestly call a plateau. Long-context coherence — the ability of a frontier model to hold a million tokens in working memory and to maintain consistent reasoning across them — went from a research curiosity in early 2025 to a deployed default by the end of the same year. Agentic coding — the ability of a model to read a real codebase, navigate it, propose and execute changes, run the tests, fix the failures, and iterate to a working state without supervision — went from a brittle laboratory demonstration to the actual working pattern of a substantial fraction of professional software engineering. Multimodal reasoning — the integration of vision, audio, and structured-document processing into the same generation loop — went from a feature the labs advertised to a feature one stopped noticing was there.

None of these is the headline capability gain of a benchmark score; all of them are, taken together, the substance of what one would have meant in 2023 by the models got dramatically better. The honest reading of the period is that they did. The dishonest reading is that they did not, and the dishonest reading is the one a great many people are reaching for, because it accords with a satisfying narrative about the limits of large language models and because it is rhetorically convenient for a press cycle that has tired of the previous one.

What ‘Slowing’ Would Even Look Like

There is a methodological problem at the bottom of all of this that deserves naming directly. The field has no outside reference frame against which to measure its own rate of progress. There is no constant comparison group — no team of human experts being measured on the same evolving benchmark in the same way over the same period — that would allow one to say the gap between the model and the reference closed by X percentage points per quarter in 2024 and by Y percentage points per quarter in 2026. What one has instead is a sequence of benchmarks each of which lives for eighteen to twenty-four months before saturating, and a press cycle that mistakes the saturation of any one benchmark for the saturation of the underlying capability.

If the cadence really has slowed, one would expect particular signals: a flattening of the agentic benchmark curves before saturation; a divergence between the labs’ internal evaluations and their public ones; a slowdown in the rate at which new capabilities cross from research demonstration to product deployment; an increase in the share of new releases that are predominantly cost-and-latency improvements. Some of these signals are present in the 2025-2026 record, and some are not. The agentic curves are still rising at substantial rates, though the rates are no longer the rates of 2024. The internal-versus-public evaluation gap, where it can be observed, is widening modestly. The research-to-deployment lag has, if anything, shortened. The cost-and-latency share of the engineering yield has unambiguously risen.

If the cadence really has not slowed, one would expect signals as well: continued movement on the new harder benchmarks at rates comparable to the old benchmarks’ rates at the equivalent point in their lives; the appearance of qualitatively new capabilities at the same six-monthly intervals; the labs continuing to be willing to bet large compute commitments on training runs whose payoff is uncertain. These signals are mixed. The new-benchmark movement is fast but the comparison is hard. The qualitatively-new-capability question is the one I find hardest: long-horizon agentic coding is, on most defensible readings, a qualitatively new capability that arrived in late 2024; nothing of the same character has arrived in the eight months since, though several things of nearly-the-same character have. The compute commitments have continued, and have in fact accelerated.

The picture that emerges from reading the signals against each other is not the picture either side of the conversation wants. It is the picture of a field whose rate of legible benchmark progress has slowed for understandable structural reasons (saturation, measurement-regime change), whose rate of deployed-capability progress has not visibly slowed and may have continued unchanged, and whose underlying technical regime has shifted axes in a way that makes the question has progress slowed? genuinely ill-defined for the period in question. One cannot, in fairness, give either a clean yes or a clean no.

The Provisional Reading

Three things, on my reading of the record, seem to me worth setting down with the modest confidence the evidence actually supports.

The first is that the legible cadence has slowed. This is not in serious dispute. The benchmarks one could point at in 2023 have ceased to discriminate; the model-version increments are smaller as fractions of headline scores than they were; the press has lost the ability to write a clean Model X is N percent better than Model Y story without footnotes. Whether this is the legibility slowing or the underlying capability slowing is exactly the question one cannot resolve from the legibility itself.

The second is that the deployed cadence — the rate at which the working capability of the frontier models advances on the axes one would actually deploy them on — has not visibly slowed, and may, on the agentic axes, have accelerated. The shape of what one can do with a frontier model in May 2026, compared with what one could do in May 2025, is a meaningfully larger difference than the headline benchmark scores suggest, and a meaningfully different kind of difference than the 2023-to-2024 comparison was. Long horizons; autonomous tool use; reliable multi-step reasoning chains; useful integration of multimodal input; cost and latency that have changed deployment economics by an order of magnitude — these are real, they are recent, and they are not consistent with the claim that the field has plateaued.

The third is the one I should like to leave plainly, because it is the one a careful reading actually supports. We do not, at present, have measurement infrastructure of sufficient discrimination to answer the question whether the rate of frontier capability gain has slowed in 2025-2026. We have indications. We have signals that lean, on balance and on my reading, slightly toward the underlying rate is intact but the axis has shifted. We have no instrument that would let one make that statement with the confidence the question is being argued with on either side. The honest position, holding the record steadily, is that the question is open, that one has provisional reasons to lean one way rather than the other, and that anyone offering a confident answer in either direction is offering something the evidence does not support.

The plateau nobody will name, taken at face value, may exist; it may not; and the absence from the conversation of anyone willing to call it either way is, I suspect, a more reliable indicator than any of the claims being made about it. A field that knew its own rate of progress would not have this much trouble describing it. That fact, more than any benchmark, is the one I find most worth holding in mind.