We instrumented 90 'fix until green' agent loops. Here's what they waste.

Boris Cherny, the creator of Claude Code, was asked about his setup at a talk last week and said something that caught my attention, as someone focused entirely on loops:

"I sort of feel like loops are the future at this point."

He wasn’t being abstract — he described running dozens of them, cron jobs that each spawn an agent session, check the result, and go again: a few hundred agents during the day, a few thousand overnight. The pattern has a name now, loop engineering, and the simplest version is the Ralph loop: while tests fail, run the agent again.

What’s interesting is that these loops are verify-revise loops. That’s the exact structure LoopGain already instruments inside agent workflows, except the iterations are a hundred to a thousand times more expensive: one iteration is an entire agent session (an outer loop), with all its tool calls and file edits, instead of one model call (an inner loop). Addy Osmani’s “Loop Engineering” essay even flags it in passing — you “absolutely have to be careful about token costs” — but as an aside, not a headline, and as far as I can tell nobody in the conversation is actually measuring it.

So we measured it. We ran 90 real fix-until-green agent loops — every iteration a real headless agent session spending real API dollars — with LoopGain watching the failing-test count from the outside. Two things came out of it: the convergence math holds cleanly at session scale, and our own stop rule has an honest bug we’re now fixing. We also caught ourselves nearly publishing a third “finding” that turned out to be an artifact of how we ran the test — that one’s near the end, because the mistake is the most useful part.

TL;DR

The convergence math is coherent at session scale. Across all 90 loops, the classifier produced zero incoherent readings — it never once labeled an improving loop as diverging or oscillating, even though each session does completely different work in different files. This is the core claim, and it held again across a separate 36-loop cross-model run.
When sessions have budget, the governed stop is clean. In the well-budgeted cell, 30 of 30 loops converged and stopped at exactly the first zero-error session. No false stops.
We found a real bug in our own stop rule. On a deliberately hard, budget-tight cell, the loop false-stopped 13 of 30 times — and 9 of those would have reached zero if we’d let them run. It’s a hardcoded rule in our core, and the fix is shipping.
The savings are real but smaller than the hype. Against a bare Ralph loop with no completion check, a governed stop cuts ~78% of spend. Against a smarter loop that already stops on success, the honest number is ~19% — the money spent grinding on loops that were never going to converge. Don’t let anyone, us included, quote the big number without the baseline.
None of this is Claude-specific. A matched run with a GPT worker reproduced the coherence and the same near-breakthrough false stops. The differences between models were capability, not budget.

What we actually ran

Each trial is a small scratch repo with seeded bugs and a pytest suite that acts as the spec — 27 to 51 tests, with anywhere from 7 to 21 failing at the start. The loop works the way a real overnight loop does:

Spawn a real headless agent session — Haiku 4.5 behind a minimal driver that makes one model call per turn — with a bounded turn budget and one instruction: run the tests, fix the code, never touch the tests.
When the session ends, the harness — not the worker — restores the test files from git and runs pytest itself. The failing count is the error signal. The worker never grades its own work.
LoopGain.observe(failing_count) gets the number, and each iteration is committed to git, so “roll back to the best iteration” means an actual SHA you can check out.
Repeat for up to 10 sessions — the outer-loop cap an uninstrumented loop would hit — so we can compare what a governed stop would have done against what the full run actually cost.

One deliberate choice up front: the worker runs on the raw API, one inference per turn, not inside a production agent CLI. A benchmark has to hold the driver constant at the simplest, most transparent layer, or “a turn” stops meaning the same thing from one run to the next. (That choice turned out to matter a lot — see the near-miss at the end.) The whole experiment — 90 loops, 271 agent sessions — cost $6.50. The nice thing about measuring waste with cheap workers is that the percentages carry; the absolute dollars stay small until your loops don’t.

We ran three kinds of loops on purpose: an easy cell where sessions had plenty of budget (10 turns), a cell with genuinely hard tail bugs designed to make loops plateau (8 turns), and a deliberately budget-tight cell (6 turns) where a session often can’t finish a fix before its budget runs out.

Where it works: the math doesn’t care that sessions are messy

My honest worry going in was that session-scale iterations would be too heterogeneous for the convergence math. Inside a workflow, iteration N and iteration N+1 are revisions of the same answer. Out here, session 3 might fix two bugs in dates.py while session 4 rewrites a CSV parser. Would the error series even mean anything?

It did, and more cleanly than I expected. Across all 90 loops there wasn’t a single coherence violation: no monotonically improving loop ever got labeled DIVERGING or OSCILLATING. A typical good trajectory looked like 14 failing tests, then 2, 1, 0 — a continue verdict at every reading on the way down, and a stop at exactly the first zero-error session. Because every iteration is a real git commit, “stop and keep the best one” returns an actual SHA, not whatever state the last session happened to leave behind.

In the well-budgeted cell, 30 of 30 loops converged and the governed stop fired at the right moment every time — zero false stops. If your loop’s sessions are big enough to make visible progress each time, the governed stop is close to free money. The interesting part is what happens when they aren’t.

Where it’s still wrong: the stop rule cuts near-breakthroughs

The budget-tight cell is the one I’d want every loop engineer to see, because it’s the cell most overnight loops actually live in.

When a session is too small to finish a fix, the work becomes bursty. A session reads the failing tests, edits half a fix, and runs out of budget. The next session finishes it, and the count finally moves. One real loop from this cell went 20, 20, 20, 20, 7, 0 — stuck for four sessions, then solved over the next two. LoopGain’s shipped stop rule killed it after the second flat reading — three sessions before the fix landed.

That’s the bug, and it’s ours. In the budget-tight cell the loop false-stopped 13 of 30 times, and 9 of those killed a run that would have reached zero failing tests. When we went looking for the cause we found it in our own core: the loop terminates after two consecutive STALLING readings, and that consecutive count is hardcoded. We do ship a configurable knob (stall_patience), but it governs when a reading first turns STALLING — not how many consecutive stalls end the loop. The count that actually pulls the trigger had no knob at all.

Here’s the part I find genuinely interesting, because it’s a property of session scale itself, not of our code. The session-boundary error signal is a coarse, decimated sample of the real fix trajectory. Each session already squeezes out everything it can before it stops, so you only get to read the error at session boundaries — and on a hard remainder, two sessions in a row can both fail to crack the same bug and report the identical count. Flat. In this cell, 54% of sessions left the failing count unchanged. At inner-loop scale — one model call per iteration — the signal has fine resolution and moves almost every step, so a no-movement reading is rare and therefore informative. At session scale, no-movement is the expected texture, not a danger sign. Same monitor, opposite statistics: the inner loop’s real risk is a bad generation spiking the error up (OSCILLATING), while the outer loop’s is a coarse flat stretch reading as a stall.

The good news is the fix is cheap and we measured it. We replayed every loop under stall thresholds from 2 to 5 consecutive readings. Raising the count from 2 to 5 cut false stops across all 90 loops from 14 to 7 while catching every genuine non-converging grind — the true catches stayed flat at 5 across the whole sweep. At session scale, a single flat reading is just weak evidence; you have to see several in a row before “stuck” is the right call.

What an ungoverned loop actually wastes

So what does the governed stop save? It depends entirely on what you compare it to, and the honest answer is two numbers, not one:

Stop rule	vs a bare cap-10 loop (no completion check)	vs an until-green loop (stops on success)
LoopGain, shipped behavior	saves 78.3%	saves 19.4%
LoopGain, patient variant (count 5)	saves 75.4%	saves 8.5%

The first column is the real Ralph loop a lot of people actually run — while :; do agent; done, no success check, grinding to a fixed cap. Against that, a governed stop saves about four-fifths, because most loops converge in a few sessions and the rest is pure waste. But that’s a low bar: any early stop clears it. The honest column is the second one — a loop that already quits on success. There, LoopGain’s marginal value is just the spend on loops that were never going to converge, and on this bench that’s ~19%. Real, worth having, but not the headline number you’ll see quoted. The patient variant saves even less against the smart baseline — which makes sense, it’s trading savings for killing fewer of those recoverable runs. That trade is the whole design question, and it’s why the stall count is becoming a knob.

One more thing about those percentages, because it’s where session scale bites. Per iteration, the outer loop is the expensive one: a single agent session — a whole run of model calls and tool use — costs a hundred to a thousand times what one inner-loop model call does. Inner loops fire far more often, so their savings stack up just as real; the wrinkle out here is that one wrong stop throws away a whole expensive session at once. Ours were cheap — $0.05–$0.10 on mini-tier models and tiny repos — but a production overnight loop on a frontier model and a real codebase runs $0.50–$5 a session, hundreds a night. Multiply the percentages above by your own session cost and loop count, and call the result an extrapolation, because that’s what it is.

How we almost fooled ourselves

Here’s the near-miss I promised, because it changed how we’ll run every benchmark after this.

We first ran this study on the Claude Code CLI (claude -p) as the worker, and the budget-tight cell looked dramatic: 24 of 30 false stops, almost the entire cell read as a starved grind the monitor kept cutting. It would have made a punchier post. It was also wrong.

A claude -p “turn” is not one model call — the CLI spends turns orienting, planning tool use, managing its own context. So “6 turns” under the CLI is a very different amount of work than 6 raw-API inferences, and at that budget the CLI was starving the worker: 262 of 267 sessions hit the turn cap without finishing, which manufactured exactly the long flat trajectories that trip the stall rule. When we re-ran the identical cells, seeds, and budgets on the raw API — one inference per turn — the same cell mostly converged, and false stops dropped from 24/30 to the 13/30 you read above. The 24/30 measured claude -p’s turn semantics, not anything about loops.

The lesson cost us a finding but it’s worth more than the finding was: a benchmark has to hold the driver constant at the simplest layer. How your specific production harness behaves is a real and useful question — it’s just a different question, and mixing the two is how you end up publishing your tool chain’s quirks as laws of nature.

It’s not a Claude thing

LoopGain is deliberately model-agnostic, so we ran a matched cross-model comparison: Haiku 4.5 and gpt-5-mini through the same minimal raw-API driver, same two tools, one inference per turn, identical budgets, same repos and bugs. 36 loops, $1.58.

The monitor findings replicated: zero coherence violations in all 36 loops, a clean 12-for-12 in the well-budgeted cells, genuine stalls correctly caught on both models, and the same near-breakthrough false stops as the main run — one GPT loop sat flat at 16 failing tests for several sessions, then cleared them, after the shipped rule had already killed it. The early-kill problem is a property of the stop rule, not of any model.

What differs between the models is capability, not budget. At identical budgets, Haiku converged 14 of 18 loops to gpt-5-mini’s 10, and on the hard plateau modules Haiku kept attempting edits where GPT more often stopped trying. Budget behavior, though, was near-identical: at six turns both models reached a file edit on every fresh session, and the flat, no-edit sessions only showed up late, after a loop had plateaued — at almost the same rate for both (69% each). So if you move a loop between models, expect capability differences; don’t expect to need different turn budgets. And the prices: the matched cells cost $1.26 (Haiku) vs $0.32 (gpt-5-mini) for identical work — worker prices vary by multiples at the same nominal tier, which is its own argument for measuring instead of guessing.

What we’re changing because of this

This experiment was the validation gate for supporting outer loops at all, and it both passed and embarrassed us in the same week, which I think is what a good benchmark is supposed to do.

The consecutive-stall count is now a real, configurable parameter (stall_terminate_count, shipped in v0.6.0). It was hardcoded; before, only the stall onset was tunable. The default stays 2 — it’s right for inner, per-call loops — but session-scale loops can now raise it. The data points to ~5, and we’re running a larger study to pin the session-scale default with a confidence interval before we change the default itself.
The integration docs get the budget rule, stated driver-relative. Turn budgets only mean something for a specific harness, so measure your own floor instead of copying anyone’s number — ours included.
Benchmarks run on the raw API, one inference per turn. Production-harness behavior gets its own, separately labeled tests. (See above for why we learned this the hard way.)
One of our trial workers wrote an infinite loop into the code under test and hung the evaluation — our harness had a timeout, and yours should too. If your error signal can hang, your loop can hang.

The honest limits

These are scratch repos with seeded bugs, sized so that hundreds of real agent sessions stay affordable — not production codebases. The workers are mini-tier models. The savings number that matters is the modest one (~19% against a loop that already stops on success); the big one is against a baseline that barely tries. The plateau cell stalled less than we designed it to — the workers beat our “hard” modules more often than expected. And a convergence monitor inherits its verifier’s blind spots: a loop that converges on code passing the wrong tests will stop confidently on a wrong answer — we’ve measured that rate at 4.5% in earlier work and wrote about it here. Treat every number here as what it is: an internal benchmark, fully reproducible, with every trial record and the harness public — not a promise about your loop. The way to know what your loop wastes is to measure your loop.

Trying it on your own loop

If your loop already has a test command, the integration is genuinely small — the monitor just needs the failing count once per session:

from loopgain import LoopGain

lg = LoopGain(target_error=0.0, max_iterations=10)
while lg.should_continue():
    run_agent_session()                      # claude -p, codex, your harness
    failing = count_failing_tests()          # YOUR test command, not the agent's
    lg.observe(failing, output=git_sha())    # output: optional handle to roll back to

That output is optional — it’s just whatever lets you recover the best iteration later. We used a git commit per session (cheap, and a natural checkpoint for a loop that’s already rewriting a repo), but a snapshot path or even the raw output works the same way. Leave it off and you still get the stop signal, just without the roll-back handle.

The library is open source (Apache-2.0), and the full harness for this experiment — trial generation, both workers, every trajectory record, and the replay scripts that produced every number above — is in the repo, so you can check our math or run the whole thing yourself.

I’d genuinely like to hear what your loops look like: how big your sessions are, what your error signal is, and especially whether you’ve seen the flat-then-breakthrough pattern in your own overnight runs. That last one is exactly the case our default stop rule still trips on — you can already raise stall_terminate_count to handle it; what we’re working out is the right default, and the more real examples we see, the better we can set it.

TL;DR#

What we actually ran#

Where it works: the math doesn’t care that sessions are messy#

Where it’s still wrong: the stop rule cuts near-breakthroughs#

What an ungoverned loop actually wastes#

How we almost fooled ourselves#

It’s not a Claude thing#

What we’re changing because of this#

The honest limits#

Trying it on your own loop#

Get new posts by email