bench · locked 2026-05-25

LoopGain Benchmark

2,000 paired real-API trials. 10 cells across 6 framework adapters. Methodology pre-registered and locked 2026-05-21 — before any of the confirmatory data was collected.

Re-validated on loopgain v0.4.0 (carried unchanged into the current 0.4.1 release); results landed 2026-06-03. The v0.4.0 classifier corrects a trajectory-label bug — a stuck loop now reads as STALLING, not OSCILLATING — and the corrected verdict terminates one iteration later, so the cost headline moved 93.5%→92.8%. Full re-validation note in the bench RESULTS.md.

Side-by-side per-iteration trajectory: max_iter=20 first finds the correct answer at iteration 3, keeps iterating, and degrades to broken code by iteration 20. LoopGain detects TARGET_MET at iteration 2 and stops with the working code.
hero_seed34.png · per-iteration error trajectory · max_iter=20 vs LoopGain · single representative trial from the bench

On a single representative trial in the bench, max_iter=20 first found the correct answer at iteration 3, kept iterating, and degraded back to broken code by iteration 20. LoopGain detected TARGET_MET at iteration 2 and stopped with the working code. When the model finds the right answer early, naive max_iter=N can iterate past success and degrade the output.

The numbers

Across the full registered run — 10 cells × n=200 paired trials = 8,000 loop runs + 1,800 pairwise judge comparisons:

  max_iter=5 max_iter=10 max_iter=20 LoopGain
Total API spend $6.83 $13.65 $27.05 $1.94
Median wall-clock per trial 7.2s 14.8s 30.9s 2.1s
Implied savings vs max_iter=20 92.8% cost / 93.3% time

Absolute wall-clock is environment- and concurrency-dependent and isn't a headline metric — this re-run ran on a less-contended machine than the 0.2.0 run, so the seconds are lower. The ~15× LG-vs-max_iter=20 ratio is the stable claim; cost ratios are stable to ~1 pp run-to-run.

Bar chart: total API spend by condition. max_iter=5 = $6.83, max_iter=10 = $13.65, max_iter=20 = $27.05, LoopGain = $1.94.
cost_by_condition.png · total API spend across the bench, by condition

See it live

Open the bench data in the LoopGain dashboard →

The bench tenant's 2,000 trials are visible in the actual product dashboard — the same UI a customer would see, populated with the canonical benchmark data. Read-only public view; sign up free to instrument your own loops.

Honest disclosures

One pre-registered floor was missed without firing a kill criterion. It's surfaced in the writeup:

Under the original 0.2.0 run, H-EARLYWARN missed at 2 iterations and the widest parity spread was W1 at 5.8 pp; the corrected 0.4.0 classifier flags STALLING earlier, so median lead time now meets the ≥3-iteration floor and the widest spread moved to W2 at 5.5 pp.

Seven pre-data amendments to the methodology are preserved in BENCH_PROTOCOL.md with their full rationale. Predicted floors and kill criteria were never changed once data started landing.

The bench harness itself had two non-trivial bugs caught and fixed during the run (signal-handling under concurrency; thread-pool shutdown semantics) — both forensics documented honestly in LESSONS.md. The data here is from the post-fix, n=200, tripwire-clean run.

Reproduce it yourself

The bench, its methodology, the raw data (9.4 MB of JSONL across 10 cells + 9 judge runs), all six analysis charts, the engineering forensics, and all seven pre-data amendments are public:

github.com/loopgain-ai/loopgain-bench

$ git clone https://github.com/loopgain-ai/loopgain-bench
$ cd loopgain-bench
$ make install-dev
$ make bench     # ~$50, ~4-8h on a single Mac
$ make judge     # ~$1-2
$ make analyze   # six tables + six charts

Try LoopGain

pip install loopgain — or pip install 'loopgain[<your framework>]' for adapter extras (LangGraph, CrewAI, AutoGen, LangChain, OpenAI Agents SDK, Claude Agent SDK).

$ pip install loopgain copy
open dashboard

3 design-partner slots open for the next 30 days. Free 30-day pilot, direct founder support. Email hello@loopgain.ai.

back to loopgain.ai