pi-ensemble: agentic coding that optimizes for quality, not speed

I released pi-ensemble yesterday. It is alpha. The pattern it codifies is not.

For more than a year I have been running a multi-agent setup against my forked opencode: a project manager orchestrating specialist children, a mandatory adversarial gate before commits, a six-lens code review before merges. The combination is slower than letting a single agent rip through tickets, and it is not subtle about it. That is the whole point. pi-ensemble is the same workflow rebuilt as a clean extension to Pi, Mario Zechner's terminal coding agent. Same philosophy, fewer hacks.

The thing most agentic coding setups get wrong

The default optimization target for an AI coding agent right now is velocity. Lines of code per hour. Tickets closed per day. Time from prompt to PR. The frameworks lean into it: parallel workers, autonomous loops, "ship it" rhetoric. The metric is throughput.

This is the wrong metric if you care about whether the code is correct.

Codebases age. The cost of a bug found in review is a fraction of the cost of the same bug found in production. The cost of a security issue found by an adversarial pass is a fraction of the cost of the same issue found by Snyk in your dependency tree. The cost of a poorly-typed signature surfaced before merge is a fraction of the cost of refactoring around it three months later. Speed at the input does not save time at the output. It just shifts where you pay.

pi-ensemble is built on the opposite assumption: that the right number of agents in a workflow is "however many it takes to find what is wrong before you commit it."

The architecture

The parent pi you launch is the project manager. When you run a slash command, the extension injects PM doctrine into the system prompt for that turn. The PM then dispatches specialist children: each one is a separate Pi process spawned with pi --mode json -p --no-extensions --no-session --append-system-prompt <role.md>, with its own assembled prompt and its own context window.

Six roles ship:

Role	Purpose
`project-manager`	Orchestrates. Holds the workflow state.
`developer`	Implements. Writes the code.
`ops`	Commits, branches, PRs. Touches git.
`explore`	Research. Web, codebase, prior memory.
`adversarial-developer`	Tries to break what the developer just wrote.
`code-review-specialist`	One of six lenses applied to a finished PR.

Children do not share context with the PM. They report back through structured tool calls. This keeps each specialist's context small and focused, and prevents the slow context contamination that single-agent workflows accumulate over a long session.

Five commands

/start          Initialise session: memory, codebase index, git/PR/CI state
/research       Fan out explore specialists in parallel
/plan           Draft a GitHub issue, classify, apply template
/work           Run an issue end-to-end
/review         On-demand six-pass review of a PR or path

The interesting one is /work. Hand it an issue number and it runs the full pipeline: feature branch, optional parallel worktrees, developer dispatches to implement, mandatory adversarial gate, ops commits, PR, six-pass code review, CI watch, merge per AGENTS.md policy. The two gates are not optional. They are the reason the setup exists.

Two gates

The adversarial gate runs before every commit. An adversarial-developer child receives the diff and the implementation context and is given one job: find what is wrong. Edge cases, missing error handling, off-by-one errors, security implications, behavioural assumptions that do not hold. If it finds something, the developer gets up to three rounds of fixes. Only then does the commit happen.

This catches a class of bugs that single-agent setups miss systematically. A single agent that wrote the code is the wrong agent to evaluate it: it carries the same assumptions, the same blind spots, the same confidence about what should work. An adversarial child with a different system prompt and no context contamination finds things the writer cannot see.

The six-pass code review runs before merge. Six children, each pinned to one lens:

Lens	Looks for
Security	Auth holes, injection, secret handling, permission boundaries
Error handling	Unhandled paths, silent failures, recovery behaviour
Type safety	Coercions, nullability, invariant violations
Performance	Hot paths, allocations, N+1 patterns, sync-in-async
Architecture	Coupling, dependency direction, separation of concerns
Simplicity	Code that exists but does not need to

Findings come back as schema-validated report_finding tool calls. They get deduplicated by (path, line, title), precedence-merged so the highest severity for a given location wins, and turned into a verdict: APPROVED, ISSUES_FOUND, or CRITICAL_ISSUES_FOUND. The merge does not happen on a critical verdict without explicit override.

Six lenses are not arbitrary. They are the categories that I have seen single-agent reviews most consistently miss, distilled from a year of opencode runs and several thousand findings logged in the corresponding skill files.

What is underneath

pi-ensemble does not stand alone. It assumes a stack:

vipune for cross-session memory. Every agent calls it.
oo for context-efficient wrapping of chatty CLIs like git and gh. Without this the specialist windows fill with noise.
colgrep for semantic code search. Used to find existing implementations before writing new ones.
parallel-cli for web search and deep research. The explore role expects it.

This is the visible top of a longer-running effort. Each of these tools exists because something in the workflow needed it and the existing options were not good enough.

Per-role models

You almost certainly want a smarter model for the PM and a faster one for the specialists. pi-ensemble has a 5-layer resolution for subagent models, from per-call override down to a global default. Run /ensemble-model to pick interactively from whichever providers you have authenticated through Pi's /login (Anthropic, OpenAI, GitHub Copilot, Cerebras, whatever).

A typical configuration: Opus or Sonnet on the PM, fast Cerebras models on the lens reviewers, Sonnet on the developer and adversarial-developer. The cost arithmetic is roughly six children × two-thousand-token outputs plus context, per review cycle, which lands around $0.02 to $0.10 on Cerebras and considerably more on Anthropic. That is the tax for the quality gate.

Why Pi, not opencode

I have been running this same orchestrator pattern against my opencode fork for over a year. The maintenance cost of that fork is what eventually pushed me to migrate.

Opencode is a great piece of software. It is also a much larger surface area than I needed: a TUI, a web UI, a desktop wrapper, a server, a plugin layer, multiple packages. Every upstream change required reconciling against my own modifications, every dependency bump touched something I had to retest. The deeper I got into customising the orchestration layer, the more time I spent on integration rather than on the workflow itself.

Pi sits in a different place. The harness is lightweight: a single binary, a small system prompt that ships from the harness rather than being layered into your config, an extension model that is straightforward to develop against. There are no async tasks in either harness yet (this is the one thing I miss most), but Pi feels faster and gets out of the way more readily.

The ecosystem dynamics are also different. Anomaly has a real business model around opencode involving their Zen LLM API and hosted services, and the contribution rules reflect that. The current CONTRIBUTING.md is more open than it used to be (the old policy was effectively "no feature PRs from outside the core team"), but UI and core product changes still require a design review with the core team before implementation. For a fork that needs to evolve at the pace of my own experimentation, that is the wrong governance model. Not wrong in absolute terms; wrong for what I am trying to do.

Pi's extension story lets me ship the orchestration layer cleanly without forking the harness at all. pi-ensemble is an extension. So is pi-worktree. So is pi-permissions. Each is a separate concern, each can move on its own schedule, none of them require me to maintain a fork of the underlying tool. That is the part I could not get with opencode without spending most of my evenings on rebase.

If opencode is the right tool for you, keep using it. The work I did there is what made pi-ensemble possible. But the maintenance arithmetic stopped working for me, and the migration has paid for itself already.

The honest caveats

This is alpha. Things will change before 1.0. Specifically:

Permissions are not enforced per role. Specialists inherit Pi's default permissions. The role system prompt is the only thing keeping each in its lane. This is acceptable on a sandbox repo and not acceptable on anything you care about. pi-permissions will fix this.

The six-pass review costs real money. Not enough to matter on personal projects. Enough to matter if you fire it indiscriminately across a hundred PRs a day. Pin the cheap models for the lens reviewers; reserve the expensive ones for the PM.

Worktrees go through git CLI calls. They work, but the safer path is the pi-worktree plugin when its programmatic API stabilises.

It is tested on macOS. Linux should work. Windows almost certainly does not.

This is not for everyone. If your team measures developer output in story points and your AI integration is supposed to make those story points cheaper, pi-ensemble is the wrong tool. It will spend more tokens, take longer, and produce fewer PRs per day. What it produces will be better-reviewed, more defensible, and less likely to bite you later. That is a trade you have to actively want.

Why this exists

A year of running this pattern in opencode taught me that the bottleneck in agentic coding is not the model. It is the discipline applied to what the model produces. Models will happily ship reasonable-looking code that is wrong in non-obvious ways. The question is whether your workflow gives that code a serious chance of being caught.

pi-ensemble is one answer to that question. The interfaces will change. The philosophy will not.

Repo: github.com/randomm/pi-ensemble