Boris Cherny, the creator of Claude Code, was asked about his setup at a talk last week and said something that caught my attention, as someone focused entirely on loops:
"I sort of feel like loops are the future at this point."
He wasn’t being abstract — he described running dozens of them, cron jobs that
each spawn an agent session, check the result, and go again: a few hundred
agents during the day, a few thousand overnight. The pattern has a name now,
loop engineering, and the simplest version is the Ralph loop:
while tests fail, run the agent again.
What’s interesting is that these loops are verify-revise loops. That’s the exact structure LoopGain already instruments inside agent workflows, except the iterations are a hundred to a thousand times more expensive: one iteration is an entire agent session (an outer loop), with all its tool calls and file edits, instead of one model call (an inner loop). Addy Osmani’s “Loop Engineering” essay even flags it in passing — you “absolutely have to be careful about token costs” — but as an aside, not a headline, and as far as I can tell nobody in the conversation is actually measuring it.
So we measured it. We ran 90 real fix-until-green agent loops — every iteration a real headless agent session spending real API dollars — with LoopGain watching the failing-test count from the outside. Two things came out of it: the convergence math holds cleanly at session scale, and our own stop rule has an honest bug we’re now fixing. We also caught ourselves nearly publishing a third “finding” that turned out to be an artifact of how we ran the test — that one’s near the end, because the mistake is the most useful part.
TL;DR
- The convergence math is coherent at session scale. Across all 90 loops, the classifier produced zero incoherent readings — it never once labeled an improving loop as diverging or oscillating, even though each session does completely different work in different files. This is the core claim, and it held again across a separate 36-loop cross-model run.
- When sessions have budget, the governed stop is clean. In the well-budgeted cell, 30 of 30 loops converged and stopped at exactly the first zero-error session. No false stops.
- We found a real bug in our own stop rule. On a deliberately hard, budget-tight cell, the loop false-stopped 13 of 30 times — and 9 of those would have reached zero if we’d let them run. It’s a hardcoded rule in our core, and the fix is shipping.
- The savings are real but smaller than the hype. Against a bare Ralph loop with no completion check, a governed stop cuts ~78% of spend. Against a smarter loop that already stops on success, the honest number is ~19% — the money spent grinding on loops that were never going to converge. Don’t let anyone, us included, quote the big number without the baseline.
- None of this is Claude-specific. A matched run with a GPT worker reproduced the coherence and the same near-breakthrough false stops. The differences between models were capability, not budget.
What we actually ran
Each trial is a small scratch repo with seeded bugs and a pytest suite that acts as the spec — 27 to 51 tests, with anywhere from 7 to 21 failing at the start. The loop works the way a real overnight loop does:
- Spawn a real headless agent session — Haiku 4.5 behind a minimal driver that makes one model call per turn — with a bounded turn budget and one instruction: run the tests, fix the code, never touch the tests.
- When the session ends, the harness — not the worker — restores the test files from git and runs pytest itself. The failing count is the error signal. The worker never grades its own work.
LoopGain.observe(failing_count)gets the number, and each iteration is committed to git, so “roll back to the best iteration” means an actual SHA you can check out.- Repeat for up to 10 sessions — the outer-loop cap an uninstrumented loop would hit — so we can compare what a governed stop would have done against what the full run actually cost.
One deliberate choice up front: the worker runs on the raw API, one inference per turn, not inside a production agent CLI. A benchmark has to hold the driver constant at the simplest, most transparent layer, or “a turn” stops meaning the same thing from one run to the next. (That choice turned out to matter a lot — see the near-miss at the end.) The whole experiment — 90 loops, 271 agent sessions — cost $6.50. The nice thing about measuring waste with cheap workers is that the percentages carry; the absolute dollars stay small until your loops don’t.
We ran three kinds of loops on purpose: an easy cell where sessions had plenty of budget (10 turns), a cell with genuinely hard tail bugs designed to make loops plateau (8 turns), and a deliberately budget-tight cell (6 turns) where a session often can’t finish a fix before its budget runs out.
Where it works: the math doesn’t care that sessions are messy
My honest worry going in was that session-scale iterations would be too
heterogeneous for the convergence math. Inside a workflow, iteration N and
iteration N+1 are revisions of the same answer. Out here, session 3 might fix
two bugs in dates.py while session 4 rewrites a CSV parser. Would the error
series even mean anything?
It did, and more cleanly than I expected. Across all 90 loops there wasn’t a
single coherence violation: no monotonically improving loop ever got labeled
DIVERGING or OSCILLATING. A typical good trajectory looked like 14 failing
tests, then 2, 1, 0 — a continue verdict at every reading on the way down, and
a stop at exactly the first zero-error session. Because every iteration is a
real git commit, “stop and keep the best one” returns an actual SHA, not
whatever state the last session happened to leave behind.
In the well-budgeted cell, 30 of 30 loops converged and the governed stop fired at the right moment every time — zero false stops. If your loop’s sessions are big enough to make visible progress each time, the governed stop is close to free money. The interesting part is what happens when they aren’t.
Where it’s still wrong: the stop rule cuts near-breakthroughs
The budget-tight cell is the one I’d want every loop engineer to see, because it’s the cell most overnight loops actually live in.
When a session is too small to finish a fix, the work becomes bursty. A
session reads the failing tests, edits half a fix, and runs out of budget. The
next session finishes it, and the count finally moves. One real loop from this
cell went 20, 20, 20, 20, 7, 0 — stuck for four sessions, then solved over
the next two. LoopGain’s shipped stop rule killed it after the second flat
reading — three sessions before the fix landed.
That’s the bug, and it’s ours. In the budget-tight cell the loop false-stopped
13 of 30 times, and 9 of those killed a run that would have reached zero
failing tests. When we went looking for the cause we found it in our own core:
the loop terminates after two consecutive STALLING readings, and that
consecutive count is hardcoded. We do ship a configurable knob
(stall_patience), but it governs when a reading first turns STALLING — not
how many consecutive stalls end the loop. The count that actually pulls the
trigger had no knob at all.
Here’s the part I find genuinely interesting, because it’s a property of
session scale itself, not of our code. The session-boundary error signal is a
coarse, decimated sample of the real fix trajectory. Each session already
squeezes out everything it can before it stops, so you only get to read the
error at session boundaries — and on a hard remainder, two sessions in a row
can both fail to crack the same bug and report the identical count. Flat. In
this cell, 54% of sessions left the failing count unchanged. At inner-loop
scale — one model call per iteration — the signal has fine resolution and moves
almost every step, so a no-movement reading is rare and therefore
informative. At session scale, no-movement is the expected texture, not a
danger sign. Same monitor, opposite statistics: the inner loop’s real risk is a
bad generation spiking the error up (OSCILLATING), while the outer loop’s is
a coarse flat stretch reading as a stall.
The good news is the fix is cheap and we measured it. We replayed every loop under stall thresholds from 2 to 5 consecutive readings. Raising the count from 2 to 5 cut false stops across all 90 loops from 14 to 7 while catching every genuine non-converging grind — the true catches stayed flat at 5 across the whole sweep. At session scale, a single flat reading is just weak evidence; you have to see several in a row before “stuck” is the right call.
What an ungoverned loop actually wastes
So what does the governed stop save? It depends entirely on what you compare it to, and the honest answer is two numbers, not one:
| Stop rule | vs a bare cap-10 loop (no completion check) | vs an until-green loop (stops on success) |
|---|---|---|
| LoopGain, shipped behavior | saves 78.3% | saves 19.4% |
| LoopGain, patient variant (count 5) | saves 75.4% | saves 8.5% |
The first column is the real Ralph loop a lot of people actually run —
while :; do agent; done, no success check, grinding to a fixed cap. Against
that, a governed stop saves about four-fifths, because most loops converge in a
few sessions and the rest is pure waste. But that’s a low bar: any early stop
clears it. The honest column is the second one — a loop that already quits on
success. There, LoopGain’s marginal value is just the spend on loops that were
never going to converge, and on this bench that’s ~19%. Real, worth having,
but not the headline number you’ll see quoted. The patient variant saves even
less against the smart baseline — which makes sense, it’s trading savings for
killing fewer of those recoverable runs. That trade is the whole design
question, and it’s why the stall count is becoming a knob.
One more thing about those percentages, because it’s where session scale bites. Per iteration, the outer loop is the expensive one: a single agent session — a whole run of model calls and tool use — costs a hundred to a thousand times what one inner-loop model call does. Inner loops fire far more often, so their savings stack up just as real; the wrinkle out here is that one wrong stop throws away a whole expensive session at once. Ours were cheap — $0.05–$0.10 on mini-tier models and tiny repos — but a production overnight loop on a frontier model and a real codebase runs $0.50–$5 a session, hundreds a night. Multiply the percentages above by your own session cost and loop count, and call the result an extrapolation, because that’s what it is.
How we almost fooled ourselves
Here’s the near-miss I promised, because it changed how we’ll run every benchmark after this.
We first ran this study on the Claude Code CLI (claude -p) as the worker,
and the budget-tight cell looked dramatic: 24 of 30 false stops, almost the
entire cell read as a starved grind the monitor kept cutting. It would have made
a punchier post. It was also wrong.
A claude -p “turn” is not one model call — the CLI spends turns orienting,
planning tool use, managing its own context. So “6 turns” under the CLI is a
very different amount of work than 6 raw-API inferences, and at that budget the
CLI was starving the worker: 262 of 267 sessions hit the turn cap without
finishing, which manufactured exactly the long flat trajectories that trip the
stall rule. When we re-ran the identical cells, seeds, and budgets on the raw
API — one inference per turn — the same cell mostly converged, and false
stops dropped from 24/30 to the 13/30 you read above. The 24/30 measured
claude -p’s turn semantics, not anything about loops.
The lesson cost us a finding but it’s worth more than the finding was: a benchmark has to hold the driver constant at the simplest layer. How your specific production harness behaves is a real and useful question — it’s just a different question, and mixing the two is how you end up publishing your tool chain’s quirks as laws of nature.
It’s not a Claude thing
LoopGain is deliberately model-agnostic, so we ran a matched cross-model comparison: Haiku 4.5 and gpt-5-mini through the same minimal raw-API driver, same two tools, one inference per turn, identical budgets, same repos and bugs. 36 loops, $1.58.
The monitor findings replicated: zero coherence violations in all 36 loops, a clean 12-for-12 in the well-budgeted cells, genuine stalls correctly caught on both models, and the same near-breakthrough false stops as the main run — one GPT loop sat flat at 16 failing tests for several sessions, then cleared them, after the shipped rule had already killed it. The early-kill problem is a property of the stop rule, not of any model.
What differs between the models is capability, not budget. At identical budgets, Haiku converged 14 of 18 loops to gpt-5-mini’s 10, and on the hard plateau modules Haiku kept attempting edits where GPT more often stopped trying. Budget behavior, though, was near-identical: at six turns both models reached a file edit on every fresh session, and the flat, no-edit sessions only showed up late, after a loop had plateaued — at almost the same rate for both (69% each). So if you move a loop between models, expect capability differences; don’t expect to need different turn budgets. And the prices: the matched cells cost $1.26 (Haiku) vs $0.32 (gpt-5-mini) for identical work — worker prices vary by multiples at the same nominal tier, which is its own argument for measuring instead of guessing.
What we’re changing because of this
This experiment was the validation gate for supporting outer loops at all, and it both passed and embarrassed us in the same week, which I think is what a good benchmark is supposed to do.
- The consecutive-stall count is now a real, configurable parameter
(
stall_terminate_count, shipped in v0.6.0). It was hardcoded; before, only the stall onset was tunable. The default stays 2 — it’s right for inner, per-call loops — but session-scale loops can now raise it. The data points to ~5, and we’re running a larger study to pin the session-scale default with a confidence interval before we change the default itself. - The integration docs get the budget rule, stated driver-relative. Turn budgets only mean something for a specific harness, so measure your own floor instead of copying anyone’s number — ours included.
- Benchmarks run on the raw API, one inference per turn. Production-harness behavior gets its own, separately labeled tests. (See above for why we learned this the hard way.)
- One of our trial workers wrote an infinite loop into the code under test and hung the evaluation — our harness had a timeout, and yours should too. If your error signal can hang, your loop can hang.
The honest limits
These are scratch repos with seeded bugs, sized so that hundreds of real agent sessions stay affordable — not production codebases. The workers are mini-tier models. The savings number that matters is the modest one (~19% against a loop that already stops on success); the big one is against a baseline that barely tries. The plateau cell stalled less than we designed it to — the workers beat our “hard” modules more often than expected. And a convergence monitor inherits its verifier’s blind spots: a loop that converges on code passing the wrong tests will stop confidently on a wrong answer — we’ve measured that rate at 4.5% in earlier work and wrote about it here. Treat every number here as what it is: an internal benchmark, fully reproducible, with every trial record and the harness public — not a promise about your loop. The way to know what your loop wastes is to measure your loop.
Trying it on your own loop
If your loop already has a test command, the integration is genuinely small — the monitor just needs the failing count once per session:
from loopgain import LoopGain
lg = LoopGain(target_error=0.0, max_iterations=10)
while lg.should_continue():
run_agent_session() # claude -p, codex, your harness
failing = count_failing_tests() # YOUR test command, not the agent's
lg.observe(failing, output=git_sha()) # output: optional handle to roll back to
That output is optional — it’s just whatever lets you recover the best
iteration later. We used a git commit per session (cheap, and a natural
checkpoint for a loop that’s already rewriting a repo), but a snapshot path or
even the raw output works the same way. Leave it off and you still get the stop
signal, just without the roll-back handle.
The library is open source (Apache-2.0), and the full harness for this experiment — trial generation, both workers, every trajectory record, and the replay scripts that produced every number above — is in the repo, so you can check our math or run the whole thing yourself.
I’d genuinely like to hear what your loops look like: how big your sessions are,
what your error signal is, and especially whether you’ve seen the
flat-then-breakthrough pattern in your own overnight runs. That last one is
exactly the case our default stop rule still trips on — you can already raise
stall_terminate_count to handle it; what we’re working out is the right
default, and the more real examples we see, the better we can set it.