How to design a strong verifier for AI agent loops

A verify-revise loop is a deal between two parts. One part proposes; the other part judges. The loop runs until the judge is satisfied. LoopGain’s whole job is to read that judgment over time and decide the moment the loop has actually converged — instead of grinding to a fixed max_iterations cap and burning tokens on a settled answer.

But there is a quiet assumption underneath all of it: that the judge is right. LoopGain acts on the error signal you give it. If your verifier reports zero errors, LoopGain trusts that and stops. It controls when the loop stops, not whether your verifier was correct to say so.

So the most important component in an agent loop is the one people spend the least time on: the verifier. This is a field guide to building one strong enough to trust at the stop.

“Error = 0” means zero detected errors

Start with the number that motivated this post.

Our public benchmark runs real verify-revise loops across six frameworks. On the code-generation workload, the loop writes a Python function and a test suite grades it; the error is the count of failing tests, and the loop stops when that hits zero. Standard setup.

Here’s the catch. During the loop, the grader runs only a sample of the available tests — the base cases plus eight sampled edge cases, around fifteen in all. The full suite for those same problems has around 110. So we did something simple: we took every run that converged — every run where the loop confidently declared “zero errors, done” — and re-graded its output against the full ~110-case oracle it never saw.

About 1 in 20 — 4.5% — had a bug the loop’s checks never caught. The code passed all of its in-loop tests, the error read zero, LoopGain stopped, and the answer was still wrong on an edge case nobody sampled.

That 4.5% is not LoopGain misbehaving. LoopGain read the trajectory correctly: the error genuinely went to zero and stayed there. The loop converged to a wrong fixed point, and it did so because the verifier — not the loop — had a blind spot. Error = 0 means “zero errors I can see,” not “correct.”

And 4.5% is a floor, not a ceiling. That in-loop sample — roughly 15 assertions, eight of them drawn from the edge-case pool — is a fairly strong proxy. Most real loops are graded by far less — a couple of hand-written assertions, a JSON-schema check, or another model asked “does this look right?” The weaker your verifier, the more wrong answers wear a clean error of zero.

The two ways a verifier fails you

Verifiers fail in two distinct directions, and they call for different fixes.

Blind spots — the verifier says “good” when the answer is bad. Incomplete coverage. The ~15-of-110 case above. This is the dangerous one, because nothing looks wrong: the error drops to zero, the loop converges cleanly, and there is no signal — not in LoopGain, not anywhere — that the stop was premature. You only find out in production.

Noise — the verifier’s score wobbles even when quality doesn’t. A stochastic grader (an LLM judge at temperature, a flaky integration test) returns a different number on the same output. This shows up as a loop that won’t settle, or one that stops early on a lucky-good reading. It is annoying, but at least it is visible — the trajectory is jumpy.

A strong verifier is mostly a sustained effort to shrink the first failure mode without amplifying the second.

Five principles for a verifier you can trust at the stop

1. Prefer an oracle to an opinion

An oracle answers “is this correct?” deterministically: a test suite that executes the code, a schema that validates the structure, a type checker, an exact-match against a known-good result. An opinion estimates it: an LLM grader, a heuristic score, a similarity threshold.

Use an oracle whenever one exists. It is cheaper (no model call), it is stable (no temperature), and its “zero errors” actually means zero. Reach for an opinion only when the task genuinely has no oracle — open-ended generation, summarization, judgment calls — and then treat its score as an estimate, not a verdict.

The common mistake is using an opinion where an oracle was available: grading generated SQL by asking a model “is this query correct?” when you could have run it against a fixture database and compared rows.

2. Cover the edges, not the happy path

Most verifiers test what the author was already thinking about. The bugs live where the author wasn’t.

Empty inputs. A single element. Duplicates. Negative numbers. Zero. The maximum. Unicode. The off-by-one at the boundary. Our 4.5% was almost entirely this: code that handled the obvious cases and broke on one edge the in-loop sample happened to skip.

# Weak: confirms the happy path the author had in mind.
assert dedupe([1, 2, 3]) == [1, 2, 3]

# Strong: probes where bugs actually hide.
assert dedupe([]) == []
assert dedupe([1, 1, 1]) == [1]
assert dedupe([3, 1, 3, 2, 1]) == [3, 1, 2]   # order preserved?
assert dedupe([0, False]) == [0]              # 0 == False trap

A verifier that only confirms the happy path will hand LoopGain a clean zero on code that is one edge case away from wrong.

3. Hold out the checks you judge by

If the loop can see the checks it will be graded on, it will optimize for those checks — and a model is very good at making a specific set of assertions pass without solving the general problem. This is Goodhart’s law in a loop: the moment a measure becomes the target, it stops measuring what you cared about.

Keep a separation. Let the loop iterate against one set of checks; judge the final result against a held-out set it never optimized against. Our benchmark accidentally demonstrated both sides of this: the loop optimized against its ~15-test sample, and the gap only became visible when we graded against the full ~110 it had never seen. If we had let the loop see all 110, the 4.5% would have vanished — not because the code got better, but because we would have stopped measuring the thing that mattered.

4. Make equality mean equality

A verifier is only as trustworthy as its notion of “correct,” and that notion hides in your assertions.

While measuring the 4.5%, we hit a function that returns combinations as a list of tuples. One solution returned the right combinations in a different order. Was it wrong? Our first, naive re-grader said yes — it compared with strict ==, which is order-sensitive. The answer was arguably correct; the assertion was too strict.

# Order-sensitive — flags a set-correct answer as wrong.
assert combinations(xs, 2) == EXPECTED

# Says what you actually mean.
assert Counter(combinations(xs, 2)) == Counter(EXPECTED)   # order-free

The lesson cuts both ways. Too strict, and your verifier rejects correct answers, forcing reruns that eat the savings. Too loose, and it accepts wrong ones. Decide deliberately what equality means for your output — order, whitespace, float tolerance, set vs. sequence — and encode exactly that. A sloppy == is a blind spot wearing a confident face.

5. Make the error signal monotone

LoopGain reads the shape of your error over iterations — is it falling, stalling, diverging. That only works if the number means something. A good error signal goes down as quality goes up, smoothly, with no plateaus that hide progress and no jumps that aren’t real.

Counting failing tests is monotone: fix one, the number drops by one. A binary pass/fail is not — it tells you nothing until the very end, so the loop can’t tell “almost there” from “hopeless.” Where you can, give the loop a graded signal (how many checks fail, how far off the value is) rather than a single all-or-nothing bit. It makes both the loop and LoopGain’s read of it sharper.

When there is no oracle

Some tasks genuinely can’t be graded deterministically. For those, two patterns help.

An LLM judge, done right. Give it an explicit rubric rather than “is this good?”. Use a different model than the one in the loop, so you are not asking a model to grade its own blind spots. Calibrate it against a handful of human-labeled examples before you trust it. And remember it is an opinion — treat its score as noisy, not final.

A second opinion at the stop. The expensive, thorough check doesn’t have to run every iteration — most iterations aren’t the last one. Run your cheap signal throughout, and reserve a stronger check for the single moment the loop is about to stop. One careful look at the finish line, not a careful look at every step. That keeps the cost where it belongs and catches the confident-wrong stop right before it ships.

A checklist

Before you trust a loop’s “done,” ask:

Is there an oracle for this task, and am I using it instead of an opinion?
Do my checks cover edge cases, or just the happy path?
Are the checks I judge by held out from the ones the loop optimizes against?
Does my notion of equality match what “correct” actually means here?
Is the error signal monotone — does it fall as quality rises?
If there’s no oracle, is my judge rubric-based, cross-model, and calibrated?
Am I checking result.best_error (or my own pass/fail) before I trust the output?

The honest framing

LoopGain controls when a loop stops. You own the definition of done. Those are different jobs, and the second one is yours.

That is not a limitation we are hiding — it is the contract. A monitor that stops a loop the instant it converges is a real, measurable saving, but it inherits the judgment of whatever verifier you hand it. This isn’t a quirk of our monitor; it is what any feedback loop does — it drives hard to the target its feedback path defines. Hand it a verifier with a blind spot and it converges, confidently, onto that blind spot. Pair it with a strong one and you get cheap loops that stop at the right answer.

So spend the time on the verifier. It is the part of the loop that decides what “correct” means, and everything else — including us — just trusts it.

LoopGain is an open-source Barkhausen-stability monitor for AI agent loops. The benchmark, the 4.5% measurement, and the verification harness behind it are all on GitHub. If you’re building verify-revise loops and want to compare notes on verifier design, the repo is the place — issues and discussions welcome.

“Error = 0” means zero detected errors#

The two ways a verifier fails you#

Five principles for a verifier you can trust at the stop#

1. Prefer an oracle to an opinion#

2. Cover the edges, not the happy path#

3. Hold out the checks you judge by#

4. Make equality mean equality#

5. Make the error signal monotone#

When there is no oracle#

A checklist#

The honest framing#

Get new posts by email