We built a benchmark to find out what LoopGain
actually saves on real agent loops, then pre-registered the predictions before we
looked at the data so we couldn’t move the goalposts. It ran 2,000 paired trials —
each loop run under three fixed max_iterations caps (5, 10, 20) and under
LoopGain — across six agent frameworks: four conditions × 2,000 trials = 8,000
total loop runs.
The headline is a 92.8% reduction in API spend versus max_iter=20,
with median wall-clock dropping from 30.9s to 2.1s. That number is real, and it’s
the least interesting thing we found.
The interesting findings are the ones that complicate it. Here are five.
Surprise 1: the state we built to catch oscillation mostly catches stalls
LoopGain classifies a loop into five trajectory states — FAST_CONVERGE,
CONVERGING, STALLING, OSCILLATING, DIVERGING. We shipped all five. (A
sixth row below, TARGET_MET, is the separate short-circuit that fires when
your error signal hits its target — not one of the five trajectory states, but
it shows up in the emission counts.) Across 2,000 trials and 3,191 state
emissions, here’s how often each actually fired:
| State | Emissions | Share |
|---|---|---|
TARGET_MET | 1,302 | 40.8% |
FAST_CONVERGE | 836 | 26.2% |
STALLING | 680 | 21.3% |
DIVERGING | 364 | 11.4% |
CONVERGING | 8 | 0.25% |
OSCILLATING | 1 | 0.03% |
OSCILLATING and CONVERGING are slivers at this scale; STALLING is the third-most-common signal.Here is the surprise. We originally built that third band thinking of it as the
oscillation catcher — the loop bouncing between two near-but-not-equal answers.
At scale, OSCILLATING fired once in 2,000 trials. The band that actually
does the work is STALLING, with 680 emissions. The failure mode real agent
loops produce is not a loop thrashing up and down — it is a loop getting stuck,
error pinned at a constant, grinding out iterations that change nothing.
That distinction is the whole point of the band, and it’s worth being precise
about: an oscillation is error bouncing between values; a stall is error
pinned flat. A loop stuck at “11 failing tests, 11 failing tests, 11 failing
tests” is stalling, not oscillating — and labeling it correctly is what lets the
monitor stop it. (An earlier build of the classifier mislabeled flat trajectories
as OSCILLATING; calling a stuck loop a stall is the more correct reading, and
it’s what shipped.)
CONVERGING fired eight times and OSCILLATING once — the two states
that describe gradual grind-down or genuine bistable bouncing are, at this model
capability and on these workloads, almost theoretical. The reason is a fact about
2026-era models, not about our classifier: on calibrated tasks, LLMs one-shot
or they get stuck. They nail it on iteration one, or they pin at a constant
error. The smooth textbook convergence trajectory that control theory loves to
draw is something real agent loops rarely produce.
The product implication is uncomfortable and we took it: “detects all five trajectory modes” is technically true but oversells what’s validated. So we lead with what the data actually supports — catches stalls and divergence, and stops them — the states with 1,000+ real emissions between them, not the ones that fired a handful of times.
Surprise 2: the 92.8% is loaded toward the easy cases
A single headline number hides a distribution. Segment the savings by what the loop actually did:
| Loop outcome | n | Savings vs max_iter=20 |
|---|---|---|
| converged (hit target early) | 1,302 | 96.6% |
| diverged | 364 | 83.9% |
| stalled | 333 | 78.2% |
max_iter=20, split by loop outcome. The 92.8% headline is the blend of a mostly-easy corpus; failure-mode workloads land closer to 78–84%.LoopGain saves money in every segment — but if your loops usually succeed fast, you’ll see numbers closer to 96.6%; if you live in adversarial, long-tail failure-mode territory, closer to 78–84%. Both are real. “92.8%” is the blend of a mostly-easy corpus, and quoting it without the spread would be the kind of benchmark sleight-of-hand we built this to avoid. (Stalled loops save the least — 78.2% — because a stall takes two consecutive flat readings to confirm before LoopGain stops, so it runs a hair longer than a hard divergence. More correct, marginally less cheap, and we’d rather show it than hide it.)
Surprise 3: on normal workloads, we mostly don’t make answers better
This is the one we most wanted to be different. On natural-distribution workloads — ordinary code-gen, debate, planning, RAG — a blind judge preferred LoopGain’s output to the run-to-cap baseline 50–63% of the time. That’s not “better.” That’s preserved: LoopGain returns an answer just as good, at roughly 5% of the cost.
LoopGain only improves quality when running longer actively hurts. On workloads engineered to degrade — where the model is pushed to keep “fixing” an already-good answer — the judge preferred LoopGain 92–95% of the time, because best-so-far rollback returns the iteration that worked instead of the one that got mangled.
This isn’t a single-trial anecdote — it’s a measurable pattern across the whole
benchmark. Of the fixed-cap (max_iter=20) runs that ran past their best
iteration, 35.3% shipped a final answer worse than the best they had already
reached — a median of 3× the error, up to 11×. The other ~65% merely plateaued,
burning tokens on an answer that never improved. A fixed cap can’t tell “done”
from “keep going,” so about a third of the time it grinds a good answer into a
worse one. Best-so-far rollback is what turns that around: LoopGain returns the
iteration that actually worked, not the mangled one the cap stopped on. That’s why
best-so-far rollback — not just early stopping — is the mechanism that protects
quality.
So the honest claim is two sentences, not one: LoopGain preserves quality on workloads where the model usually succeeds, at a fraction of the cost. It improves quality on workloads where iterating past success degrades the output. Anyone who tells you their loop controller makes every answer better is selling you something.
Surprise 4: we missed one of our own predictions
We pre-registered specific numeric floors before collecting data. We beat most of them. One we hit right at the line, and one we missed — and pre-registration is the thing that forces us to say so:
- Early-warning lead time. We predicted LoopGain would flag a diverging loop a median of ≥ 3 iterations before the fixed cap hits its catastrophic point. It flagged at a median of 3 — exactly the floor, with no margin. It did flag in 364 of 367 catastrophe trials, so it almost never misses the divergence; it just doesn’t give you more warning than we promised. Met, reported flat, not rounded up.
- Framework parity. We predicted ≤ 5 percentage-points of quality spread between adapters on the same task. On debate (AutoGen vs CrewAI) we saw 5.5 — missed by 0.5 pp. Well under the threshold that would have killed the claim, but past what we predicted, so it’s in the writeup, not a footnote.
The parity miss didn’t fire a kill criterion. We’re reporting it because a careful reader finds it anyway, and honest is faster than spin.
Surprise 5: aggressive stopping has a real cost on retrieval
On the iterative-RAG cell, LoopGain cut cost 98.1% — the biggest single-cell saving in the bench. It also retrieved the gold document 4.5 percentage points less often than the run-to-cap baseline (85.0% vs 89.5% hit@5). LoopGain’s early stop occasionally cuts off a retrieval loop that would have found the answer with a few more revision attempts.
That’s a genuine trade-off, not a rounding error: on retrieval workloads you can buy back some of that 4.5 pp by making LoopGain’s stopping more conservative. We report the default-config result because shipping the number that flatters you and hiding the tuning knob is how benchmarks lose trust.
Why pre-register a benchmark for your own product?
Because a benchmark’s job isn’t to produce the cleanest number — it’s to produce the truest one, and the only way to keep yourself honest is to write the predictions down before you can see the data. We locked the protocol, the hypotheses, and six kill criteria before any real cell ran. Zero of the six fired — but one predicted floor was missed and another landed right at the line, and you just read about both, because the registration made hiding them impossible.
Everything is public: the protocol with its amendments, the full raw trial data (every iteration of all 8,000 runs), the per-cell tables, and the live dashboard populated with all 2,000 trials. Reproduce it, argue with it, or run it on your own workloads — the harness is in the repo.
The headline is 92.8% cheaper. The reason to believe it is everything above that complicates it.
LoopGain is pip install loopgain, Apache-2.0. The benchmark, protocol, and raw
data are open; the numbers here
are the registered results from loopgain 0.4.0.