We ran 2,000 paired agent-loop trials. Here's what surprised us.

We built a benchmark to find out what LoopGain actually saves on real agent loops, then pre-registered the predictions before we looked at the data so we couldn’t move the goalposts. It ran 2,000 paired trials — each loop run under three fixed max_iterations caps (5, 10, 20) and under LoopGain — across six agent frameworks: four conditions × 2,000 trials = 8,000 total loop runs.

The headline is a 92.8% reduction in API spend versus max_iter=20, with median wall-clock dropping from 30.9s to 2.1s. That number is real, and it’s the least interesting thing we found.

Total API spend across all 2,000 trials, by stopping condition. The fixed caps climb with their iteration budget; LoopGain stops when the loop is done.

The interesting findings are the ones that complicate it. Here are five.

Surprise 1: the state we built to catch oscillation mostly catches stalls

LoopGain classifies a loop into five trajectory states — FAST_CONVERGE, CONVERGING, STALLING, OSCILLATING, DIVERGING. We shipped all five. (A sixth row below, TARGET_MET, is the separate short-circuit that fires when your error signal hits its target — not one of the five trajectory states, but it shows up in the emission counts.) Across 2,000 trials and 3,191 state emissions, here’s how often each actually fired:

State	Emissions	Share
`TARGET_MET`	1,302	40.8%
`FAST_CONVERGE`	836	26.2%
`STALLING`	680	21.3%
`DIVERGING`	364	11.4%
`CONVERGING`	8	0.25%
`OSCILLATING`	1	0.03%

State emission counts across 2,000 trials. OSCILLATING and CONVERGING are slivers at this scale; STALLING is the third-most-common signal.

Here is the surprise. We originally built that third band thinking of it as the oscillation catcher — the loop bouncing between two near-but-not-equal answers. At scale, OSCILLATING fired once in 2,000 trials. The band that actually does the work is STALLING, with 680 emissions. The failure mode real agent loops produce is not a loop thrashing up and down — it is a loop getting stuck, error pinned at a constant, grinding out iterations that change nothing.

That distinction is the whole point of the band, and it’s worth being precise about: an oscillation is error bouncing between values; a stall is error pinned flat. A loop stuck at “11 failing tests, 11 failing tests, 11 failing tests” is stalling, not oscillating — and labeling it correctly is what lets the monitor stop it. (An earlier build of the classifier mislabeled flat trajectories as OSCILLATING; calling a stuck loop a stall is the more correct reading, and it’s what shipped.)

CONVERGING fired eight times and OSCILLATING once — the two states that describe gradual grind-down or genuine bistable bouncing are, at this model capability and on these workloads, almost theoretical. The reason is a fact about 2026-era models, not about our classifier: on calibrated tasks, LLMs one-shot or they get stuck. They nail it on iteration one, or they pin at a constant error. The smooth textbook convergence trajectory that control theory loves to draw is something real agent loops rarely produce.

The product implication is uncomfortable and we took it: “detects all five trajectory modes” is technically true but oversells what’s validated. So we lead with what the data actually supports — catches stalls and divergence, and stops them — the states with 1,000+ real emissions between them, not the ones that fired a handful of times.

Surprise 2: the 92.8% is loaded toward the easy cases

A single headline number hides a distribution. Segment the savings by what the loop actually did:

Loop outcome	n	Savings vs `max_iter=20`
converged (hit target early)	1,302	96.6%
diverged	364	83.9%
stalled	333	78.2%

Cost savings vs max_iter=20, split by loop outcome. The 92.8% headline is the blend of a mostly-easy corpus; failure-mode workloads land closer to 78–84%.

LoopGain saves money in every segment — but if your loops usually succeed fast, you’ll see numbers closer to 96.6%; if you live in adversarial, long-tail failure-mode territory, closer to 78–84%. Both are real. “92.8%” is the blend of a mostly-easy corpus, and quoting it without the spread would be the kind of benchmark sleight-of-hand we built this to avoid. (Stalled loops save the least — 78.2% — because a stall takes two consecutive flat readings to confirm before LoopGain stops, so it runs a hair longer than a hard divergence. More correct, marginally less cheap, and we’d rather show it than hide it.)

Surprise 3: on normal workloads, we mostly don’t make answers better

This is the one we most wanted to be different. On natural-distribution workloads — ordinary code-gen, debate, planning, RAG — a blind judge preferred LoopGain’s output to the run-to-cap baseline 50–63% of the time. That’s not “better.” That’s preserved: LoopGain returns an answer just as good, at roughly 5% of the cost.

LoopGain only improves quality when running longer actively hurts. On workloads engineered to degrade — where the model is pushed to keep “fixing” an already-good answer — the judge preferred LoopGain 92–95% of the time, because best-so-far rollback returns the iteration that worked instead of the one that got mangled.

This isn’t a single-trial anecdote — it’s a measurable pattern across the whole benchmark. Of the fixed-cap (max_iter=20) runs that ran past their best iteration, 35.3% shipped a final answer worse than the best they had already reached — a median of 3× the error, up to 11×. The other ~65% merely plateaued, burning tokens on an answer that never improved. A fixed cap can’t tell “done” from “keep going,” so about a third of the time it grinds a good answer into a worse one. Best-so-far rollback is what turns that around: LoopGain returns the iteration that actually worked, not the mangled one the cap stopped on. That’s why best-so-far rollback — not just early stopping — is the mechanism that protects quality.

So the honest claim is two sentences, not one: LoopGain preserves quality on workloads where the model usually succeeds, at a fraction of the cost. It improves quality on workloads where iterating past success degrades the output. Anyone who tells you their loop controller makes every answer better is selling you something.

Surprise 4: we missed one of our own predictions

We pre-registered specific numeric floors before collecting data. We beat most of them. One we hit right at the line, and one we missed — and pre-registration is the thing that forces us to say so:

Early-warning lead time. We predicted LoopGain would flag a diverging loop a median of ≥ 3 iterations before the fixed cap hits its catastrophic point. It flagged at a median of 3 — exactly the floor, with no margin. It did flag in 364 of 367 catastrophe trials, so it almost never misses the divergence; it just doesn’t give you more warning than we promised. Met, reported flat, not rounded up.
Framework parity. We predicted ≤ 5 percentage-points of quality spread between adapters on the same task. On debate (AutoGen vs CrewAI) we saw 5.5 — missed by 0.5 pp. Well under the threshold that would have killed the claim, but past what we predicted, so it’s in the writeup, not a footnote.

The parity miss didn’t fire a kill criterion. We’re reporting it because a careful reader finds it anyway, and honest is faster than spin.

Surprise 5: aggressive stopping has a real cost on retrieval

On the iterative-RAG cell, LoopGain cut cost 98.1% — the biggest single-cell saving in the bench. It also retrieved the gold document 4.5 percentage points less often than the run-to-cap baseline (85.0% vs 89.5% hit@5). LoopGain’s early stop occasionally cuts off a retrieval loop that would have found the answer with a few more revision attempts.

That’s a genuine trade-off, not a rounding error: on retrieval workloads you can buy back some of that 4.5 pp by making LoopGain’s stopping more conservative. We report the default-config result because shipping the number that flatters you and hiding the tuning knob is how benchmarks lose trust.

Why pre-register a benchmark for your own product?

Because a benchmark’s job isn’t to produce the cleanest number — it’s to produce the truest one, and the only way to keep yourself honest is to write the predictions down before you can see the data. We locked the protocol, the hypotheses, and six kill criteria before any real cell ran. Zero of the six fired — but one predicted floor was missed and another landed right at the line, and you just read about both, because the registration made hiding them impossible.

Everything is public: the protocol with its amendments, the full raw trial data (every iteration of all 8,000 runs), the per-cell tables, and the live dashboard populated with all 2,000 trials. Reproduce it, argue with it, or run it on your own workloads — the harness is in the repo.

The headline is 92.8% cheaper. The reason to believe it is everything above that complicates it.

LoopGain is pip install loopgain, Apache-2.0. The benchmark, protocol, and raw data are open; the numbers here are the registered results from loopgain 0.4.0.

Surprise 1: the state we built to catch oscillation mostly catches stalls#

Surprise 2: the 92.8% is loaded toward the easy cases#

Surprise 3: on normal workloads, we mostly don’t make answers better#

Surprise 4: we missed one of our own predictions#

Surprise 5: aggressive stopping has a real cost on retrieval#

Why pre-register a benchmark for your own product?#

Get new posts by email