I spent half of yesterday watching a pair-coding setup fail at the same task three times in a row. The setup was the one I had been quietly proud of: a developer agent and an adversarial-developer agent running concurrently, with the adversary observing the developer's live stream and able to interrupt mid-task whenever it spotted a problem. The premise felt obvious. Why wait for a bad diff when you can catch the mistake while it is being made?

In theory, an attractive idea. In practice, after many rounds of tweaking the prompts, debouncing rules, and interrupt semantics, the result has been the same every time: a confused developer agent that produces no valuable output. Yesterday was the cleanest example I have. A 16-minute session that burned 13.68M tokens and ended with a working tree full of .bak files, plus two follow-up sessions that produced zero code edits across 38 combined developer turns. The pattern was clean enough that I went back to read what the rest of the field has been doing with coder + critic agent pairs. The answer was unkind: the architecture I had built is exactly the one the production literature has been moving away from for the past year. This is the story of what went wrong, why, and what I should have been doing instead, which is the same thing I have already been doing in production for about a year.

The setup

The system is pi-ensemble, an extension I maintain that wraps the Pi terminal coding agent and turns the parent process into a project manager dispatching role-specialised child processes. One of those tools is pair_watch. It spawns a developer child and an adversarial-developer child simultaneously, summarises each developer turn into ≤500 characters of tool-call descriptions and message excerpts, and pipes that summary into the adversary as a steering input. The adversary can then call interrupt_developer whenever it sees something concerning. That call is injected back into the developer's next turn as a user-message prefixed [pair:adversarial].

The developer's system prompt tells it to "read the interrupt before your next action" and "adjust your plan." The adversary's prompt says, almost verbatim, "Restraint is false economy. If you can predict a failure mode the dev hasn't addressed, interrupt." You can probably guess where this is going.

The task that broke it

The work itself was a six-pass code review verdict on a real PR with three remaining MEDIUM findings. Two were architectural (a chat SSE contract that had started carrying research-specific event types, and a deep_research function that had grown a direct coupling to the streaming transport) and one was error-handling (an unknown-status branch that returned an error without emitting a terminal Failed event to the UI). The first two findings touched signatures in a Rust file using async tokio channels, trait objects, and lifetimes. If you have done that kind of refactor you already know what is coming: it is not difficult conceptually, but it requires holding several constraint sets in your head at once. Lifetimes that satisfy one call site need to also satisfy the other. The compile-error feedback loop is slow and the errors are walls of text.

I dispatched pair_watch. Sixteen minutes later it came back with a verdict of ESCALATED and a working tree in worse shape than before.

What the transcripts showed

I pulled the developer and adversarial JSON transcripts and walked through them turn by turn. Two distinct failure modes appeared, and they are worth separating because they have different causes.

Session 1 was the architecturally pure failure. The developer made real progress on the two simpler findings within the first four minutes. Then it hit the trait-object refactor. The Rust file oscillated in size, 671 → 854 → 622 → 671 lines, as the developer started a structural edit, was interrupted by the adversary mid-edit ("the old function body wasn't fully removed; the file has duplicate function definitions"), abandoned the partial work to address the interrupt, started a different approach, was interrupted again ("the file is now 866 lines, over the 500-line limit"), and so on. By the end the file contained the old function body, a half-finished make_progress_callback replacement, and an uncompilable type alias. The developer re-read the same file 35 times across 112 turns, about 31% of all turns. Input-to-output token ratio was 4.4:1, which is what you would expect when context is being rebuilt after every interruption rather than flowing forward.

What is important about session 1 is that every adversarial interrupt was correct. The duplicate code really was there. The file size really was over budget. The signature mismatch really did exist. The interrupts were good signal arriving at the wrong moment. Each one forced the developer to leave a half-finished structural edit and respond, which is exactly how you end up with three half-finished structural edits laid on top of each other.

Sessions 2 and 3 were a different failure, and I want to be careful to distinguish them. After the first session left a messy partial state, I asked the system to recover. This time the developer made zero edits across 38 turns: all colgrep, ctx7 documentation lookups, file-line counts, and re-reads. The adversary interrupted nine times across the two sessions, each interrupt becoming progressively more pointed ("Please stop searching and implement"). The developer did not implement.

This second mode is tempting to blame on the interrupts but I do not think that is quite right. The developer was already in an avoidance loop before the interrupts started. The working tree was incoherent, two files were mid-refactor in incompatible ways, and the model in use (a smaller one I had switched to mid-recovery) did not want to commit to an edit it might have to undo. The interrupts were trying to push the developer out of the loop, not into it. The second-order point is the one that matters: session 1's interrupt-driven thrash created the broken state that sessions 2 and 3 could not recover from. The live-steering mechanism produced an artifact, a half-edited working tree, that subsequent runs inherited as input.

Why the architecture is biased this way

I went back and read my own code, which is always a humbling exercise. The mechanism is doing exactly what it was designed to do. The design is the problem.

Three things compound. First, each interrupt is injected as a user-turn in the developer's context. There is no debouncing, no minimum gap. If the adversary sees a problem after every developer turn, the developer gets an interrupt after every turn. Second, the developer's prompt explicitly instructs it to re-plan on interrupt. So the developer treats every interrupt as a signal to pivot, not as a note to file. Third, the adversary's prompt is biased toward firing: "restraint is false economy." Combine these and you have a system that, by construction, prevents the developer from sustaining a multi-turn structural edit.

For tasks that decompose into independent steps (a small bug fix, an incremental refactor, a feature with a clear scaffolding) this is fine and probably helpful. The pivot cost is small and the catch is valuable. For tasks where the steps do not decompose, anything where you have to hold N constraints simultaneously and resolve them with a single coherent edit, every pivot is a partial-write that has to be unwound or merged. The pivot cost dominates the catch value.

What the field already knew

After enough self-flagellation I went looking for who else had tried this. The literature is more developed than I expected and the convergence is striking.

The dominant pattern in production multi-agent work is generate-then-critique with a debate loop. The MASQRAD paper from early 2025 is representative (the domain is data visualization queries rather than code, but the mechanism transfers): an actor LLM produces the full artifact, then a critic LLM enters a multi-agent debate to refine it. The critic does not interrupt generation. This is, almost exactly, the legacy developer → adversarial_loop flow that pair_watch was meant to replace.

A more recent paper, MASDP in IEEE TSE (Jan 2026), is the one that hits closest to home. The authors explicitly identify the two failure modes that emerge when critic-style agents try to steer coder agents: "the difficulty in accurately interpreting complex role prompts" and "the fragility of inter-agent coordination." They propose a different architecture entirely. Rather than a reactive critic, they fine-tune an anticipatory "Reminder" agent that frontloads likely-failure warnings into the coder's initial prompt and iteratively refines those precautions based on execution feedback. The result outperforms GPT-4 baselines while reducing compute, on small LLMs. The phrasing in the paper is striking: they describe their contribution as shifting optimisation burden off the programmer agent. That is the inverse of what live steering does, which is to add burden. Every interrupt is a new context to integrate.

A2C-LLM (MDPI Drones, May 2026) goes further in a different domain (UAV swarm task allocation): replace the LLM critic with a lightweight value-function head. The critic produces a scalar advantage signal, not English-language commentary. This is cheaper, more stable, and does not pollute the actor's context window. The principle generalises beyond drones, but it is only feasible when you have something to optimise toward, which for general code work you usually do not.

On the framework level, the divergence is now explicit. AutoGen, the conversational turn-by-turn steering framework, is widely characterised as a research and brainstorming tool with a steeper learning curve and higher debugging difficulty that makes it ill-suited for production. LangGraph has become the production default specifically because it is deterministic, checkpoints state, and exposes observability. The community has been making this call out loud for at least a year, and I missed it because the pair-coding metaphor was too seductive.

The unifying lesson across all of these is the same: anticipatory critique beats reactive interruption. Tell the coder what to watch for before it starts. Review the result after it finishes. Do not talk to it while it is thinking.

What I am taking from this

Three honest things.

The first is the empirical one. For a year I have been running the turn-based pattern: the developer agent works in one context, the adversarial reviewer runs after the diff is produced in a separate context, and the verdict feeds either a fix loop or a merge. That setup works. It has been the backbone of every serious piece of agentic work I have shipped. Pair-watch was an attempt to improve on it by moving the critique earlier, and across many rounds of tuning it has not produced a single result that the turn-based pattern would not have produced more cheaply and more reliably. The architecture is the problem, not the prompts.

The second is the meta-point, which is the one I keep returning to. The reason pair_watch keeps being seductive is that "two agents working together in real time" sounds intuitively better than "one agent works and then another reviews." It pattern-matches to how humans pair-program well. But LLMs are not humans. They do not have the working memory to hold a multi-constraint problem across an interruption and pick it up where they left off. They re-plan from scratch every turn, and re-planning from scratch with one more user-turn worth of context is worse, not better, than re-planning from scratch with the original problem statement. The thing that makes human pair programming work, shared continuous state, is the thing LLMs structurally do not have. No amount of prompt tweaking changes that.

The third is what I am actually going to do. Pair-watch is gone from the default workflow. The default returns to where it has been for a year: generate, then adversarial-review, then six-pass lens review. The mechanism stays in the repo because there are tasks where the pivot cost is low and an early "stop, you're about to do X" is cheap insurance: linear bug fixes, walking through a series of small independent edits, anything where the plan does not need to bend. For those, opt-in. For everything else, the live channel between the two agents is a tax on the developer's attention.

The next experiment is the MASDP pattern from the IEEE TSE paper: front-load the adversary's predictable warnings into the developer's initial prompt as anticipatory constraints and let the developer work through them without interruption. Whether that improves on plain post-hoc review I genuinely do not know yet. At least now I know what hypothesis I am testing, and which one has already failed.


If you maintain a multi-agent system that does live steering and you have seen it converge on hard problems, I would like to hear about it. Particularly the wall-clock-per-task numbers and what kinds of tasks it handles well. The literature I found was largely against the pattern. The counterexamples might be more interesting than the consensus.