[{"content":" Stopping an AI agent loop is a signal-processing problem. Most teams treat it as a prompt-engineering one — tune the instructions, set max_iterations=5, hope — and that\u0026rsquo;s exactly why it doesn\u0026rsquo;t work. You can\u0026rsquo;t fix a control problem with a better prompt.\nHere\u0026rsquo;s the control problem. Your agent runs a verify-revise loop. Each iteration produces an output and some measure of how wrong it still is. That measure, over iterations, is a time series — a noisy signal. \u0026ldquo;Should the loop keep going?\u0026rdquo; is a question about that signal: is the error trending down, flat, or blowing up, and is that trend real or just noise? max_iterations answers a different question entirely — \u0026ldquo;have we done this N times yet?\u0026rdquo; — which is why it stops good loops early and lets bad ones run to the cap.\nThis post is how LoopGain answers the real question. It\u0026rsquo;s an open-source (Apache-2.0) library, so none of this is a black box — but the math is worth understanding even if you never read the source.\nOne naming note, since they look alike: loop gain (lowercase) is the control-theory quantity this post is about; LoopGain (the library) is named after it. The metric is the interesting part — LoopGain just computes it and acts on it.\nThe one number you have to provide: the error signal LoopGain doesn\u0026rsquo;t define correctness for you. Each iteration, you hand it a single number — the error signal — that says how wrong the current output is:\nfailing unit tests (count of failures), schema violations (count), an LLM-judge\u0026rsquo;s distance from the target, retrieval miss count, lint errors, whatever \u0026ldquo;wrong\u0026rdquo; means for your loop. state = lg.observe(errors, output) # errors: a number, or a list whose length is the magnitude Everything below operates on the sequence of those numbers. If you can\u0026rsquo;t produce an error signal, LoopGain has nothing to read — that\u0026rsquo;s the one hard requirement, and we\u0026rsquo;ll come back to it.\nLoop gain: the per-step ratio, and why it\u0026rsquo;s not enough alone The namesake quantity is loop gain, written AβA\\betaAβ, from the Barkhausen stability criterion in control engineering. It\u0026rsquo;s the factor by which the error changes from one iteration to the next:\nAβ=EnEn−1A\\beta = \\frac{E_n}{E_{n-1}}Aβ=En−1​En​​Aβ\u0026lt;1A\\beta \u0026lt; 1Aβ\u0026lt;1 means the error shrank — the loop is improving. Aβ≥1A\\beta \\geq 1Aβ≥1 means it held or grew — the loop is stuck or making things worse. That\u0026rsquo;s the whole intuition, and if loops were noiseless you could stop right there: watch AβA\\betaAβ, stop when it crosses 1.\nLoops are not noiseless. A single EnEn−1\\frac{E_n}{E_{n-1}}En−1​En​​ ratio bounces around — one lucky iteration drops the error, the next recovers it, and the instantaneous ratio swings wildly even when the underlying trend is a clean decline. Act on one ratio and you\u0026rsquo;ll stop on noise. So LoopGain doesn\u0026rsquo;t act on one ratio. It reads the whole trajectory.\nStep 1: work in log space Barkhausen says En=Aβ⋅En−1E_n = A\\beta \\cdot E_{n-1}En​=Aβ⋅En−1​ — error decays (or grows) geometrically. Take the log of both sides and a geometric trend becomes a straight line:\nlog⁡En=log⁡E0+nlog⁡(Aβ)\\log E_n = \\log E_0 + n \\log(A\\beta)logEn​=logE0​+nlog(Aβ)So LoopGain transforms the error history to log⁡10(E)\\log_{10}(E)log10​(E). Now \u0026ldquo;is this loop converging?\u0026rdquo; becomes \u0026ldquo;does this line slope down?\u0026rdquo; — and a slope is something you can fit and test instead of eyeball.\nStep 2: fit the trend, then test whether it\u0026rsquo;s real LoopGain fits an ordinary least-squares line to log⁡10(E)\\log_{10}(E)log10​(E) versus iteration. The slope of that line is the geometric-average log⁡Aβ\\log A\\betalogAβ across the whole loop — a far more stable estimate than the last single ratio.\nBut a downward slope on five noisy points might be chance. So LoopGain runs a two-sided t-test on the slope and gets a p-value. A trend only counts as real when p\u0026lt;0.05p \u0026lt; 0.05p\u0026lt;0.05 — the same significance bar you\u0026rsquo;d use anywhere else. This is the difference between \u0026ldquo;the error happened to drop\u0026rdquo; and \u0026ldquo;the error is significantly decreasing.\u0026rdquo; It\u0026rsquo;s pure stdlib — a closed-form OLS slope and a Student-t p-value, no SciPy, no model.\nStep 3: measure the wobble A loop can have a flat trend and still be useless — bouncing between two answers, never settling. LoopGain detrends the log-error (subtracts the fitted line) and takes the standard deviation of the residuals. High residual scatter with a flat slope is the signature of oscillation: lots of motion, no progress.\nThe decision: five named states From three features — the cumulative reduction Ecurrent/EfirstE_{\\text{current}}/E_{\\text{first}}Ecurrent​/Efirst​, the fitted slope and its significance, and the oscillation std — LoopGain classifies the loop\u0026rsquo;s current state. The rule, in plain form:\nState Condition (informally) Action TARGET_MET error hit your target (e.g. zero failing tests) stop, keep the answer FAST_CONVERGE error collapsed ~10× from the start and still dropping continue, predict ETA CONVERGING slope significantly down (or a solid cumulative drop) continue STALLING progress has flattened — no new low for a few steps stop once it persists, keep best-so-far OSCILLATING high wobble around a flat trend stop, roll back DIVERGING slope significantly up, past a margin stop, roll back The \u0026ldquo;stop, we\u0026rsquo;re done\u0026rdquo; case is TARGET_MET — a short-circuit that fires the moment your error signal hits its target, before any trajectory math runs. FAST_CONVERGE is the opposite of done: the error has dropped a lot and is still dropping, so the loop keeps going (and LoopGain predicts the ETA to target). A loop only stops on a healthy trajectory when it actually reaches the target; it stops early when the trajectory turns bad — OSCILLATING / DIVERGING — or goes flat (STALLING).\nThe thresholds aren\u0026rsquo;t tuned to make a demo look good — they\u0026rsquo;re pre-registered and derived from convention: a one-decade (90%) reduction is the textbook step-response settling criterion; p\u0026lt;0.05p \u0026lt; 0.05p\u0026lt;0.05 is standard significance; the oscillation cutoff corresponds to roughly a ±2× ripple, an underdamped Q≈3Q \\approx 3Q≈3 response. You can override them, but the defaults come from control theory and statistics, not from fitting the benchmark.\nTwo things the trajectory buys you for free Once you\u0026rsquo;re fitting a log-trend, two useful quantities fall out.\nETA. If the loop is converging toward a target error EtargetE_{\\text{target}}Etarget​, the iterations remaining is a closed-form Barkhausen prediction:\nnremaining=log⁡(Etarget/Ecurrent)log⁡(Aβsmooth)n_{\\text{remaining}} = \\frac{\\log(E_{\\text{target}} / E_{\\text{current}})}{\\log(A\\beta_{\\text{smooth}})}nremaining​=log(Aβsmooth​)log(Etarget​/Ecurrent​)​So LoopGain can tell you \u0026ldquo;about 3 more iterations\u0026rdquo; instead of just \u0026ldquo;still going.\u0026rdquo; (It returns nothing when the prediction is undefined — no target, or a non-converging gain — rather than guessing.)\nGain margin. Borrowed straight from control engineering: GM=1/max⁡(Aβsmooth)\\text{GM} = 1 / \\max(A\\beta_{\\text{smooth}})GM=1/max(Aβsmooth​). Greater than 1 means the loop never crossed into instability; the larger, the more headroom it had. It\u0026rsquo;s a one-number summary of how close the whole loop came to blowing up.\nBest-so-far rollback When the loop stops on OSCILLATING or DIVERGING, the last output is — by definition — not the best one; it\u0026rsquo;s the degraded one. So LoopGain keeps the output associated with the lowest error it ever saw and hands that back:\nanswer = lg.result.best_output # the iteration that worked, not the last one This is the part that turns \u0026ldquo;stop early to save money\u0026rdquo; into \u0026ldquo;stop early and return a better answer,\u0026rdquo; because on a diverging loop the last iteration is often worse than one you passed three steps ago.\nWhere this doesn\u0026rsquo;t work — honestly The math has hard edges, and you should know them before you reach for it:\nSingle-pass agents get nothing. One model call, no revise step, no trajectory. There\u0026rsquo;s nothing to fit. No error signal, no LoopGain. If you can\u0026rsquo;t put a number on how wrong an output is, none of the above runs. The quality of the control is bounded by the quality of your error signal. It needs a few iterations. With one or two points the slope has no degrees of freedom to test — the significance machinery can\u0026rsquo;t engage, and the cap is doing the real work. Gradual convergence is the rare trajectory in practice. In our 2,000-trial benchmark, the CONVERGING and OSCILLATING states fired on well under 1% of iterations — modern LLMs tend to one-shot or stall, not glide down over many steps or thrash between answers. The states LoopGain earns its keep on are the ones that quietly burn budget: STALLING (the loop pinned, going nowhere — 21% of iterations) and DIVERGING, caught before the loop walks past its best answer. None of this is hidden in the library. It\u0026rsquo;s a few hundred lines of stdlib Python under Apache-2.0 — read the classifier, disagree with a threshold, open an issue. The point isn\u0026rsquo;t that loop gain is magic. It\u0026rsquo;s that \u0026ldquo;when is this loop done?\u0026rdquo; has an answer grounded in a century of control theory, and max_iterations isn\u0026rsquo;t it.\nLoopGain is pip install loopgain, Apache-2.0. The math here lives in the loopgain.classifier module; the benchmark behind the \u0026ldquo;rare trajectory\u0026rdquo; claim is open.\n","permalink":"https://loopgain.ai/blog/posts/how-loop-gain-works/","summary":"max_iterations is a guess because it ignores the one thing that tells you whether to stop: the trajectory of the error. Here\u0026rsquo;s how LoopGain reads that trajectory — loop gain, a log-space trend fit, and a t-test — to decide an agent loop is done.","title":"How loop gain works: knowing when an AI agent loop has converged"},{"content":" An AI agent in a verify-revise loop will keep going until something stops it. The question every team eventually asks is: what should that something be?\nThe default answer is max_iterations. You pick a number — usually 5, sometimes 3, occasionally 10 — and the loop runs until it either succeeds or hits the cap. It is the most-shipped control mechanism in agentic AI, and it is a guess. The loop has no idea whether iteration 5 is one step from done or three steps past the point where it started making the answer worse.\nWhen the guess costs money — and at current token prices, a runaway verify-revise loop costs real money — teams reach for tooling. This post is an honest map of what is out there. Most of it is open-source; one widely-used incumbent is not. But the license is not the line that matters. The line that matters is this: does the tool observe the loop, or does it control it?\nThe two questions a loop tool can answer There are two different questions, and most tools answer only the first:\n\u0026ldquo;What did my agent just do?\u0026rdquo; — tracing, token accounting, latency, the full transcript of each step, after the fact. This is observability. It is genuinely useful and most teams need it. \u0026ldquo;Should this loop keep going right now?\u0026rdquo; — a decision, made during the loop, that can stop it, roll it back, or let it continue. This is control. Observability is a dashboard you read after the run. Control is a function that runs inside the run. A team can want both, but they are not the same product, and conflating them is the most common mistake in evaluating this space.\nHere is the landscape on that axis.\nThe observability layer LangSmith LangSmith is the observability and evaluation platform from the team behind LangChain. It does tracing, monitoring, and offline evaluation, and it is tightly integrated with LangChain and LangGraph.\nTwo things to be precise about, because they get muddled: LangSmith is proprietary — the LangChain frameworks are open-source, but LangSmith itself is a commercial product, cloud-first, with self-hosting available only on the Enterprise plan. And its job is to record and evaluate what your agents did, not to intervene mid-run.\nWhere it sits on the line: observe. LangSmith gives you deep visibility into each step of the loop and lets you score quality after the fact. It does not decide, while the loop is running, whether the loop should still be running.\nLangfuse Langfuse is an open-source (MIT for the core; enterprise extras live in a separate /ee folder) LLM engineering platform — tracing, metrics, evals, prompt management, datasets — built on OpenTelemetry and self-hostable as a Docker container with a Postgres backend.\nWhere it sits on the line: observe, and notably open-source and self-hostable, which matters for teams with data-sovereignty requirements. Langfuse and LoopGain are complementary, not competitive: Langfuse records the loop; LoopGain decides whether it should still be running. (More on the one place this gets subtle — evals — below.)\nHelicone Helicone is an open-source (Apache-2.0) AI gateway and observability platform. Its signature is the integration model: change one URL and your LLM calls route through Helicone, which logs the full request and response, tracks tokens and cost, and can apply caching, rate-limiting, routing across providers, and custom metadata before forwarding the call.\nWhere it sits on the line: observe, with a gateway twist. Because Helicone sits in the request path, it can gate individual calls — rate-limit them, cache them, route them. But gating a single request is not the same as deciding that a verify-revise loop has converged. A proxy sees one call at a time; it does not see the convergence trajectory across iterations, which is the signal a loop controller acts on.\nThe baseline that ships in every codebase max_iterations (and its cousin, the action-hash check) No tool required. Two patterns cover almost every loop in production today. The cap — LangGraph spells it recursion_limit, most hand-rolled loops spell it max_iterations, but it is the same idea:\nfor i in range(max_iterations): # the cap result = agent.step() if result.done: break and the slightly smarter \u0026ldquo;stop if nothing changed\u0026rdquo;:\nseen = set() while True: result = agent.step() h = hash(result.action) # the action-hash check if h in seen: # exact repeat → bail break seen.add(h) These are not strawmen. They are the real, reasonable, everywhere-deployed baseline, and they fail in two specific ways:\nThe cap is a guess. max_iterations=5 stops a loop that needed 3 (wasting two iterations) and a loop that needed 8 (returning a worse answer than it had at iteration 4). One number cannot be right for both. The action-hash check only catches exact repetition. A loop that is oscillating between two near-but-not-identical answers, or slowly degrading, hashes to a new value every time and runs to the cap. This is the gap the control layer exists to close.\nThe control layer — where LoopGain sits LoopGain answers question 2. It is an open-source (Apache-2.0) Python library that measures, every iteration, whether the loop is actually converging — and acts on that measurement while the loop is still running.\nYou give LoopGain an error signal each iteration — whatever \u0026ldquo;wrong\u0026rdquo; means for your loop: the number of failing tests, the count of schema violations, an LLM-judge\u0026rsquo;s distance from the target, a lint-error count. LoopGain doesn\u0026rsquo;t define correctness for you; you hand it the number, and it watches that number\u0026rsquo;s trajectory.\nThe underlying quantity is loop gain, written AβA\\betaAβ, borrowed from the Barkhausen stability criterion in control engineering: the factor by which the error changes from one iteration to the next. Aβ\u0026lt;1A\\beta \u0026lt; 1Aβ\u0026lt;1 means each pass shrinks the error; Aβ≥1A\\beta \\geq 1Aβ≥1 means it holds or grows — the signature of an agent talking itself out of a correct answer.\nA single step is noisy, though, so LoopGain doesn\u0026rsquo;t react to one ratio. It looks at the trajectory of your error signal across the whole loop. Working in log space — where a geometric AβA\\betaAβ trend becomes a straight line — it fits the trend across iterations, runs a significance test to check the trend is real rather than chance, and measures how much the error is oscillating around it. From those, it classifies the loop\u0026rsquo;s current state — one of five trajectory bands, plus the TARGET_MET short-circuit — and acts on it:\nState What it means Action TARGET_MET error hit your target stop, keep the answer FAST_CONVERGE error collapsed fast, still dropping continue, predict ETA CONVERGING error trending down, significantly continue STALLING trend has flattened stop and keep best-so-far once it persists OSCILLATING bouncing around, not improving stop, roll back to best-so-far DIVERGING error trending up stop, roll back to best-so-far TARGET_MET is the \u0026ldquo;done\u0026rdquo; stop — a short-circuit that fires when your error signal hits its target. FAST_CONVERGE is not a stop: the error has dropped sharply but isn\u0026rsquo;t at target yet, so the loop keeps going (LoopGain predicts the ETA). The loop stops early when the trajectory goes bad (OSCILLATING / DIVERGING) or flat (STALLING).\nThe integration wraps your existing loop. should_continue() is the new guard, observe() records each iteration\u0026rsquo;s error, and the best-so-far answer is there if the loop degraded:\nfrom loopgain import LoopGain lg = LoopGain(max_iterations=20) # the cap is now a backstop, not the plan while lg.should_continue(): output = agent.step() # your verify-revise step errors = verify(output) # the error signal: failing tests, etc. lg.observe(errors, output) # measure Aβ; updates the loop\u0026#39;s state answer = lg.result.best_output # best-so-far — even if the last pass got worse should_continue() returns False the moment LoopGain detects a terminal state — converged, stalled, oscillating, diverging, or the max_iterations backstop. max_iterations does not disappear; it becomes the backstop instead of the primary control. The loop now stops when it has converged, not when it has run out of retries — and lg.result.best_output gives you the best answer the loop ever produced, not whatever it happened to land on at the end.\n\u0026ldquo;But I can do that with an eval\u0026rdquo; The fair objection. Langfuse and Helicone both have evals, and you could wire an LLM-as-judge eval into your loop and stop when it crosses a threshold. People do.\nThe difference is what you are measuring and what you have to build. An eval gives you an absolute quality score and leaves you to write the stopping logic — pick a threshold, handle the case where quality goes up then down, remember the best-so-far answer, decide what \u0026ldquo;no longer improving\u0026rdquo; means. LoopGain measures the dynamics of the loop directly — the trend of the error across iterations and whether it\u0026rsquo;s statistically real — which is what actually tells you whether more iterations will help, and ships the stop + rollback as the product. One is a metric you build a controller around; the other is the controller.\nThe worked arithmetic The reason this is worth a library and not a shrug: the waste is measurable.\nIn our public benchmark — 2,000 paired real-API trials across four conditions — a fixed max_iterations=20 baseline cost $27.05 in total tokens. The same workloads under LoopGain cost $1.94: a 92.8% cost reduction, or $25.11 saved across the run, because most loops had converged long before iteration 20 and LoopGain stopped them there. Median wall-clock dropped from 30.9s to 2.1s — about 15× faster — for the same reason: you stop paying for iterations that are no longer improving the answer.\nQuality held. On natural-distribution workloads a blind judge preferred LoopGain\u0026rsquo;s output to the run-to-cap baseline 50–63% of the time (i.e. preserved it), and on workloads engineered to degrade — where running to the cap actively makes the answer worse — it preferred LoopGain 92–95% of the time, because the rollback caught the degradation. The full protocol, the pre-registered floor we missed, and the raw data are public.\nWhere LoopGain doesn\u0026rsquo;t help It only does anything for iterative loops. A single-pass agent — one model call, no revise step — has no trajectory to read, so there is nothing to control. You also have to be able to produce an error signal each iteration; if you cannot say how wrong an output is, LoopGain cannot tell whether the loop is improving. And it needs a few iterations to estimate the trend — on a loop that finishes in one or two steps, the cap is doing the work, not LoopGain. The benchmark above is real, but it is our workloads; how much you save depends on how noisy your error signal is and how often your loops actually run long. If your loops are already short and cheap, you do not need this.\nSo which tool do you actually want? The honest answer, and the whole point of drawing the line:\nIf your question is \u0026ldquo;what did my agents do?\u0026rdquo; — you want an observability tool. Langfuse or Helicone if you want open-source and self-hostable; LangSmith if you are already in the LangChain ecosystem and want evals next to your traces (and are fine with a proprietary, cloud-first tool). If your question is \u0026ldquo;should this loop still be running?\u0026rdquo; — that is control, and observability tools do not answer it. That is where LoopGain lives. Most teams running real agent loops eventually want both: LoopGain to decide, an observability tool to record what it decided. They sit at different layers and compose cleanly. LoopGain is pip install loopgain, Apache-2.0, with adapters for LangGraph, CrewAI, AutoGen, LangChain, OpenAI Agents, and the Claude Agent SDK — plus a raw API for anything else. The benchmark, the protocol, and the data are open.\nFAQ Is LangSmith open source? No. LangSmith is a proprietary, cloud-first product from the LangChain team; self-hosting is available only on the Enterprise plan. The LangChain and LangGraph frameworks are open-source, but LangSmith itself is not. Langfuse (MIT) and Helicone (Apache-2.0) are the open-source, self-hostable observability options.\nWhat\u0026rsquo;s the difference between AI agent observability and loop control? Observability tools (LangSmith, Langfuse, Helicone) record and evaluate what an agent loop did, after the fact — traces, token cost, quality scores. Loop control (LoopGain) decides, while the loop is still running, whether it should keep going, stop, or roll back to its best answer. Observability is a dashboard you read after the run; control is a function that runs inside the run. Most teams running real loops want both.\nCan\u0026rsquo;t I just lower max_iterations? No single cap is right for every input. max_iterations=5 stops a loop that finished at iteration 3 (wasting two passes) and returns a degraded answer for one that peaked at iteration 4 of 8. LoopGain measures each loop\u0026rsquo;s actual convergence (loop gain, Aβ) and stops when that loop has converged or started to degrade — instead of guessing one number for all of them.\nWhat is loop gain (Aβ) for an AI agent loop? Loop gain, written Aβ, is the factor by which the error changes from one iteration to the next. Aβ \u0026lt; 1 means the loop is converging (error shrinking); Aβ ≥ 1 means it\u0026rsquo;s stalling or diverging. From an error signal you supply (failing tests, schema violations, a judge score), LoopGain estimates the Aβ trend across the loop — fitting it in log space and testing that the trend is statistically real — and stops the loop when that trajectory shows it has converged or begun to degrade. The term comes from the Barkhausen stability criterion in control engineering.\nDo I need both LoopGain and an observability tool? They compose rather than compete. LoopGain controls the loop at runtime (stop + best-so-far rollback) and runs in-process as an Apache-2.0 library; an observability tool records what happened for later analysis. Use LoopGain to keep loops from running too long or degrading, and pair it with Langfuse, Helicone, or LangSmith if you also want tracing and evals.\nLoopGain is an open-source library that stops AI agent loops when they\u0026rsquo;ve converged and rolls back when they\u0026rsquo;re degrading — so you stop paying to iterate past a good answer. Tool details above were verified against each project\u0026rsquo;s own documentation in May 2026; capabilities change, so check the primary sources if you are making a decision on them.\n","permalink":"https://loopgain.ai/blog/posts/open-source-tools-for-monitoring-ai-agent-loops/","summary":"Most agent tooling tells you what happened after the loop ran. A smaller category acts while the loop is still running. Here\u0026rsquo;s the landscape, drawn on the line that matters: observe versus control.","title":"Tools for monitoring and controlling AI agent loops"}]