Benchmark

The loop-engineering wave is real, and so is the spend. We ran 90 real fix-until-green agent loops with LoopGain watching the failing-test count. The convergence math held perfectly at session scale — zero incoherent classifications in 90 loops — and it stops cleanly when sessions have budget (30/30). But the stop rule has an honest bug: on a hard, budget-tight cell it false-stopped 13 of 30 times, 9 of them one session before the fix would have landed. We traced it to a hardcoded rule in our own core, measured the savings honestly (big vs a naive loop, modest vs a smart one), and caught ourselves nearly publishing a third ‘finding’ that was really a test artifact. Here’s all of it.

Benchmark

We instrumented 90 'fix until green' agent loops. Here's what they waste.

We ran 2,000 paired agent-loop trials. Here's what surprised us.

Get new posts by email