H100 vs Strix Halo: the gap is bigger than the first benchmark suggested

I have been benchmarking Qwen3.6 on an UpCloud H100 80GB SXM, comparing it against the numbers I have been running on my Strix Halo box. The first sweep, four days ago, used Qwen3.6-27B-FP8 (dense) on the H100 under vLLM 0.21, and my established Qwen3.6-35B MoE Q8 numbers on Strix Halo under llama.cpp. The directional answer it gave me was "single-user generation is closer than you would think; prefill and concurrency win." That answer turns out to have been half-right and half-misleading. I ran the like-for-like comparison today and want to correct the picture.

The context for this work has not changed. At Trail Openers we are moving toward more environmentally sustainable LLM infrastructure for our internal tooling, and the H100 setup we are building will be shared by several developers. UpCloud was chosen for three reasons that compound. First, their data centres run majority on renewable energy and their scope 1+2 emissions are compensated. Second, the fi-hel2 data centre is in Helsinki, the same metropolitan area as Trail Openers, which means the whole team gets single-digit-millisecond latency to the endpoint. Third, EU jurisdiction and GDPR-native data handling matter for the work we do; the prompts, the diffs, the codebases, none of that leaves the EU.

What the first sweep got wrong

The first benchmark ran a dense 27B on the H100 against an MoE on Strix Halo. That comparison was fair on weight footprint (28-34 GB class) but unfair on architectural fit. Dense and MoE behave very differently per-token, and the model choice on each side biased the result in opposite directions.

The summary numbers from that first sweep:

MetricH100 (Qwen3.6-27B dense FP8)Strix Halo (Qwen3.6-35B MoE Q8)Δ
Prefill rate (~512 input tokens)11,378 t/s1,388 t/s~8.2×
Single-request generation rate77 t/s54 t/s~1.4×

The 1.4× looked like good news for Strix Halo. It is not. It is the artefact of two model choices, not a real architectural finding. Strix Halo was running the MoE variant that suits its ~225 GB/s bandwidth (only the active experts have to be streamed per token). The H100 was running a dense model where every parameter is touched on every forward pass. That is not a fair fight in either direction.

The fair comparison is running the same MoE variant on both, with each platform's best serving stack and best decoding tricks. So I did that.

The like-for-like sweep

Yesterday I redeployed the H100 with Qwen3.6-35B-A3B-FP8 (the same 3B-active / 35B-total MoE Strix Halo runs) plus MTP speculative decoding turned on (--speculative-config '{"method":"mtp","num_speculative_tokens":2}'). MTP runs the model's built-in multi-token-prediction draft heads in parallel with the main forward pass; each accepted speculation multiplies effective throughput. KV-cache compression (--kv-cache-dtype fp8) and the Marlin MoE backend (--moe-backend marlin) round out the configuration.

The new single-request comparison:

MetricH100 (35B MoE + MTP)Strix Halo (35B MoE)Δ
Median TTFT (4k input)123 ms434 ms~3.5×
Median TPOT (decode step)3.6 ms~18.5 ms~5.1×
Single-request generation rate236 t/s54 t/s~4.4×

The "Strix Halo holds its own on generation" framing from the first benchmark was wrong. With both platforms running the model that actually suits them, with each platform's best decoding pipeline, the H100 is roughly 4.4× faster on raw single-user generation. The earlier 1.4× number was the H100 deliberately handicapped by dense-model arithmetic. Once it gets to use MoE plus MTP, the gap is the gap.

Two specific things drive the uplift. MoE means fewer active parameters per token, so each forward pass is cheaper. MTP means each forward pass can yield two-plus tokens instead of one. Multiply those and you get the 3.5× TPOT improvement at low load (3.6 ms versus 12.8 ms on the dense run), which compounds into the 3.15× single-request output rate versus the same H100's dense numbers.

Concurrency moves further in the same direction

The first sweep showed the H100 scaling to roughly 12× the effective serving capacity of Strix Halo at the 16-agent operating point. With the MoE+MTP configuration, that gap roughly doubles.

The mid-context concurrency shape (4k input, 512 output, agentic-coding turn size):

ConcurrencyAggregate output t/sPer-agent t/sMedian TTFTP99 TTFTMedian TPOT
1236236.2123 ms141 ms3.6 ms
4653163.4149 ms437 ms5.3 ms
81,163145.3231 ms820 ms6.2 ms
161,654103.3283 ms1.62 s8.3 ms
322,22969.7647 ms3.17 s11.1 ms

The sweet spot moves from "8 to 16 agents" to "16 to 32 agents." At concurrency 32 you still get 70 t/s per agent (more than Strix Halo's single-user rate) and 2,229 t/s aggregate. At concurrency 16 the aggregate is 1,654 t/s with P99 TTFT under 1.7 s, comfortably interactive.

Strix Halo doing the same 16-agent workload still takes roughly 60 seconds (llama.cpp does not batch). The H100 with MoE+MTP does it in roughly 2.5 seconds. The effective serving advantage is now somewhere around 24×, not 12×.

Cost recalculated

Same UpCloud business-hours pricing (€1.79/hr) divided by the new throughput numbers:

Operating pointOutput t/s€/M output tokens
Single agent236€2.11
8 concurrent1,163€0.43
16 concurrent1,654€0.30
32 concurrent2,229€0.22

At the operating point a small dev team would actually use (16 concurrent agents during work hours), the marginal output cost is roughly €0.30 per million tokens. Anthropic's GPT-5.2-class pricing is now $14 per million output tokens, so the gap is roughly 45×. Even accounting for chain-of-thought overhead on a reasoning model (more on that below) and the fact that real traffic gets lower MTP acceptance than --ignore-eos benchmarks, the economics are not close.

What this means in practice

Two specific deployment shapes still look obviously correct, but the second one looks more obviously correct than I wrote four days ago.

One Strix Halo per developer for personal use, with the right model shape. For one human at one keyboard running an MoE model that fits the memory architecture, Strix Halo remains a reasonable personal-AI machine. 128 GB unified memory, real bandwidth, no API bill, no data leaving the box. The caveat is still model selection: dense 27B-class models are not what this machine is good at. Pick the MoE variants that suit the bandwidth profile. The Strix Halo setup guide and the gotchas post cover what it takes to actually get there. What the new H100 numbers do change is your expectations of single-user speed: at 54 t/s on Strix Halo versus 236 t/s on a properly-configured H100 endpoint, the H100 is meaningfully snappier to use for the same task. Strix Halo's win is locality and cost, not throughput.

One H100 (rented) per team for multi-agent backends. The moment your workflow runs more than one agent at a time, the architecture gap is the gap that matters, and it is now even bigger. vLLM's continuous batching plus MoE plus MTP turns one GPU into a serving fleet that absorbs ~2,200 output tokens per second at the operational ceiling. pi-ensemble dispatches up to six specialist children per /work invocation; on Strix Halo those run sequentially, on an H100 they run concurrently, and at 16-agent concurrency four-to-six developers can each run their own pi-ensemble simultaneously without anyone noticing.

This is the shape we are using at Trail Openers. One shared H100 in UpCloud's Helsinki data centre, internal endpoints behind Caddy, business-hours uptime. The decision to host locally rather than reach for a hyperscaler GPU instance was driven by three things at once: the sustainability footprint (UpCloud's energy mix is majority renewable and their ESG reporting covers scope 1, 2, and 3 with compensation for scope 1+2), the data-jurisdiction story (everything stays in the EU, GDPR-native, no extra-territorial transfers), and the simple fact that the machine sits in the same metropolitan area as the team using it. Latency from a developer's desk in greater Helsinki to a model running in fi-hel2 is dominated by the local fibre hop, not by any cross-continent route. None of this makes inference free of footprint (it never is), but it shifts the marginal cost of an extra agent run onto a cleaner grid, in a friendlier jurisdiction, with materially better latency than the default AWS/GCP region you would otherwise reach.

The caveats are real

Five things to know before you act on this.

The 35B-A3B variant trades a bit of reasoning quality for throughput. Qwen's own SWE-bench numbers put the dense 27B at ~77% and the 35B-A3B at ~73%. For complex multi-step code debugging the dense model is still the better tool. For chat, summarisation, retrieval-augmented Q&A, and the bulk of agentic-coding work, the MoE+MTP combination wins on every operational axis. Pick the model by workload; do not assume one is universally better.

MTP acceptance is workload-dependent. The 3.15× single-request uplift in the table above is on --ignore-eos random-token traffic, which is unusually easy speculation. Real chat workloads, especially code generation with strict syntax, see lower acceptance rates and therefore lower uplift. Plan for 2.0× to 2.5× sustained uplift in production rather than the 3.15× peak.

Reasoning models burn most of the output budget on chain-of-thought. Qwen3.6 emits roughly 80% of every response as reasoning trace before the final answer. The 512-token output budgets in the tables above are mostly thinking, not result. If you want 200 tokens of actual answer, plan for 1,000-1,500 output tokens per agent. A non-reasoning model, or chat_template_kwargs.enable_thinking: false, would shift this picture significantly.

P99 TTFT under load still has a tail. At concurrency 16 the median TTFT is 283 ms but the P99 is 1.62 s. At concurrency 32 the P99 is 3.17 s. This is substantially better than the dense run (which hit P99 of 7.45 s at concurrency 32), but the variance still grows with load. If interactive consistency matters more than aggregate throughput, cap at 16.

First-deploy cold start is expensive. torch.compile takes about 20 minutes on first run. Subsequent boots reuse the cache and reach /health 200 in roughly 2 minutes. Worth knowing if you spin instances up and down.

The takeaway

I came into the first benchmark expecting the H100 to be dramatically faster at everything. The first sweep, with the dense 27B, suggested it was only 1.4× faster on single-user generation, and I wrote that up. That conclusion was wrong, and the way it was wrong is instructive. Comparing different-shape models across different hardware does not isolate the hardware. It tells you what your model choice is doing.

When both platforms run the same MoE variant with each platform's best decoding stack, the H100 is roughly 4.4× faster on single-user generation, roughly 8× faster on prefill, and roughly 24× more effective at concurrent serving. The architectural gap I described four days ago is real and bigger than I said. The 1.4× number should not have been the headline.

For our use at Trail Openers, the conclusion is sharper than before. Strix Halo on the desk for personal work, running MoE models that fit the bandwidth profile. A shared H100 in UpCloud's Helsinki data centre during business hours, serving the team's multi-agent backends with the sustainability footprint, EU data residency, and same-city latency we wanted. Same shape of answer as before, with a clearer view of how big the gap actually is when you compare like for like.


Benchmark details: vLLM 0.21.0. Dense 27B run (2026-06-01): max_model_len=8192, max_num_seqs=256, gpu_memory_utilization=0.9. MoE+MTP run (2026-06-05): max_num_seqs=128, gpu_memory_utilization=0.85, kv_cache_dtype=fp8, moe_backend=marlin, reasoning_parser=qwen3, speculative_config={"method":"mtp","num_speculative_tokens":2}. Both swept with vllm bench serve --dataset-name random --ignore-eos. Hardware: UpCloud H100 80GB SXM, single GPU, fi-hel2. Strix Halo numbers from my own BENCHMARKS.md, best of ROCm pr21344 / Vulkan RADV / ROCm 7.2.3 backends.

Live steering breaks deep focus: notes from three failed pair-coding sessions

I spent half of yesterday watching a pair-coding setup fail at the same task three times in a row. The setup was the one I had been quietly proud of: a developer agent and an adversarial-developer agent running concurrently, with the adversary observing the developer's live stream and able to interrupt mid-task whenever it spotted a problem. The premise felt obvious. Why wait for a bad diff when you can catch the mistake while it is being made?

In theory, an attractive idea. In practice, after many rounds of tweaking the prompts, debouncing rules, and interrupt semantics, the result has been the same every time: a confused developer agent that produces no valuable output. Yesterday was the cleanest example I have. A 16-minute session that burned 13.68M tokens and ended with a working tree full of .bak files, plus two follow-up sessions that produced zero code edits across 38 combined developer turns. The pattern was clean enough that I went back to read what the rest of the field has been doing with coder + critic agent pairs. The answer was unkind: the architecture I had built is exactly the one the production literature has been moving away from for the past year. This is the story of what went wrong, why, and what I should have been doing instead, which is the same thing I have already been doing in production for about a year.

The setup

The system is pi-ensemble, an extension I maintain that wraps the Pi terminal coding agent and turns the parent process into a project manager dispatching role-specialised child processes. One of those tools is pair_watch. It spawns a developer child and an adversarial-developer child simultaneously, summarises each developer turn into ≤500 characters of tool-call descriptions and message excerpts, and pipes that summary into the adversary as a steering input. The adversary can then call interrupt_developer whenever it sees something concerning. That call is injected back into the developer's next turn as a user-message prefixed [pair:adversarial].

The developer's system prompt tells it to "read the interrupt before your next action" and "adjust your plan." The adversary's prompt says, almost verbatim, "Restraint is false economy. If you can predict a failure mode the dev hasn't addressed, interrupt." You can probably guess where this is going.

The task that broke it

The work itself was a six-pass code review verdict on a real PR with three remaining MEDIUM findings. Two were architectural (a chat SSE contract that had started carrying research-specific event types, and a deep_research function that had grown a direct coupling to the streaming transport) and one was error-handling (an unknown-status branch that returned an error without emitting a terminal Failed event to the UI). The first two findings touched signatures in a Rust file using async tokio channels, trait objects, and lifetimes. If you have done that kind of refactor you already know what is coming: it is not difficult conceptually, but it requires holding several constraint sets in your head at once. Lifetimes that satisfy one call site need to also satisfy the other. The compile-error feedback loop is slow and the errors are walls of text.

I dispatched pair_watch. Sixteen minutes later it came back with a verdict of ESCALATED and a working tree in worse shape than before.

What the transcripts showed

I pulled the developer and adversarial JSON transcripts and walked through them turn by turn. Two distinct failure modes appeared, and they are worth separating because they have different causes.

Session 1 was the architecturally pure failure. The developer made real progress on the two simpler findings within the first four minutes. Then it hit the trait-object refactor. The Rust file oscillated in size, 671 → 854 → 622 → 671 lines, as the developer started a structural edit, was interrupted by the adversary mid-edit ("the old function body wasn't fully removed; the file has duplicate function definitions"), abandoned the partial work to address the interrupt, started a different approach, was interrupted again ("the file is now 866 lines, over the 500-line limit"), and so on. By the end the file contained the old function body, a half-finished make_progress_callback replacement, and an uncompilable type alias. The developer re-read the same file 35 times across 112 turns, about 31% of all turns. Input-to-output token ratio was 4.4:1, which is what you would expect when context is being rebuilt after every interruption rather than flowing forward.

What is important about session 1 is that every adversarial interrupt was correct. The duplicate code really was there. The file size really was over budget. The signature mismatch really did exist. The interrupts were good signal arriving at the wrong moment. Each one forced the developer to leave a half-finished structural edit and respond, which is exactly how you end up with three half-finished structural edits laid on top of each other.

Sessions 2 and 3 were a different failure, and I want to be careful to distinguish them. After the first session left a messy partial state, I asked the system to recover. This time the developer made zero edits across 38 turns: all colgrep, ctx7 documentation lookups, file-line counts, and re-reads. The adversary interrupted nine times across the two sessions, each interrupt becoming progressively more pointed ("Please stop searching and implement"). The developer did not implement.

This second mode is tempting to blame on the interrupts but I do not think that is quite right. The developer was already in an avoidance loop before the interrupts started. The working tree was incoherent, two files were mid-refactor in incompatible ways, and the model in use (a smaller one I had switched to mid-recovery) did not want to commit to an edit it might have to undo. The interrupts were trying to push the developer out of the loop, not into it. The second-order point is the one that matters: session 1's interrupt-driven thrash created the broken state that sessions 2 and 3 could not recover from. The live-steering mechanism produced an artifact, a half-edited working tree, that subsequent runs inherited as input.

Why the architecture is biased this way

I went back and read my own code, which is always a humbling exercise. The mechanism is doing exactly what it was designed to do. The design is the problem.

Three things compound. First, each interrupt is injected as a user-turn in the developer's context. There is no debouncing, no minimum gap. If the adversary sees a problem after every developer turn, the developer gets an interrupt after every turn. Second, the developer's prompt explicitly instructs it to re-plan on interrupt. So the developer treats every interrupt as a signal to pivot, not as a note to file. Third, the adversary's prompt is biased toward firing: "restraint is false economy." Combine these and you have a system that, by construction, prevents the developer from sustaining a multi-turn structural edit.

For tasks that decompose into independent steps (a small bug fix, an incremental refactor, a feature with a clear scaffolding) this is fine and probably helpful. The pivot cost is small and the catch is valuable. For tasks where the steps do not decompose, anything where you have to hold N constraints simultaneously and resolve them with a single coherent edit, every pivot is a partial-write that has to be unwound or merged. The pivot cost dominates the catch value.

What the field already knew

After enough self-flagellation I went looking for who else had tried this. The literature is more developed than I expected and the convergence is striking.

The dominant pattern in production multi-agent work is generate-then-critique with a debate loop. The MASQRAD paper from early 2025 is representative (the domain is data visualization queries rather than code, but the mechanism transfers): an actor LLM produces the full artifact, then a critic LLM enters a multi-agent debate to refine it. The critic does not interrupt generation. This is, almost exactly, the legacy developer → adversarial_loop flow that pair_watch was meant to replace.

A more recent paper, MASDP in IEEE TSE (Jan 2026), is the one that hits closest to home. The authors explicitly identify the two failure modes that emerge when critic-style agents try to steer coder agents: "the difficulty in accurately interpreting complex role prompts" and "the fragility of inter-agent coordination." They propose a different architecture entirely. Rather than a reactive critic, they fine-tune an anticipatory "Reminder" agent that frontloads likely-failure warnings into the coder's initial prompt and iteratively refines those precautions based on execution feedback. The result outperforms GPT-4 baselines while reducing compute, on small LLMs. The phrasing in the paper is striking: they describe their contribution as shifting optimisation burden off the programmer agent. That is the inverse of what live steering does, which is to add burden. Every interrupt is a new context to integrate.

A2C-LLM (MDPI Drones, May 2026) goes further in a different domain (UAV swarm task allocation): replace the LLM critic with a lightweight value-function head. The critic produces a scalar advantage signal, not English-language commentary. This is cheaper, more stable, and does not pollute the actor's context window. The principle generalises beyond drones, but it is only feasible when you have something to optimise toward, which for general code work you usually do not.

On the framework level, the divergence is now explicit. AutoGen, the conversational turn-by-turn steering framework, is widely characterised as a research and brainstorming tool with a steeper learning curve and higher debugging difficulty that makes it ill-suited for production. LangGraph has become the production default specifically because it is deterministic, checkpoints state, and exposes observability. The community has been making this call out loud for at least a year, and I missed it because the pair-coding metaphor was too seductive.

The unifying lesson across all of these is the same: anticipatory critique beats reactive interruption. Tell the coder what to watch for before it starts. Review the result after it finishes. Do not talk to it while it is thinking.

What I am taking from this

Three honest things.

The first is the empirical one. For a year I have been running the turn-based pattern: the developer agent works in one context, the adversarial reviewer runs after the diff is produced in a separate context, and the verdict feeds either a fix loop or a merge. That setup works. It has been the backbone of every serious piece of agentic work I have shipped. Pair-watch was an attempt to improve on it by moving the critique earlier, and across many rounds of tuning it has not produced a single result that the turn-based pattern would not have produced more cheaply and more reliably. The architecture is the problem, not the prompts.

The second is the meta-point, which is the one I keep returning to. The reason pair_watch keeps being seductive is that "two agents working together in real time" sounds intuitively better than "one agent works and then another reviews." It pattern-matches to how humans pair-program well. But LLMs are not humans. They do not have the working memory to hold a multi-constraint problem across an interruption and pick it up where they left off. They re-plan from scratch every turn, and re-planning from scratch with one more user-turn worth of context is worse, not better, than re-planning from scratch with the original problem statement. The thing that makes human pair programming work, shared continuous state, is the thing LLMs structurally do not have. No amount of prompt tweaking changes that.

The third is what I am actually going to do. Pair-watch is gone from the default workflow. The default returns to where it has been for a year: generate, then adversarial-review, then six-pass lens review. The mechanism stays in the repo because there are tasks where the pivot cost is low and an early "stop, you're about to do X" is cheap insurance: linear bug fixes, walking through a series of small independent edits, anything where the plan does not need to bend. For those, opt-in. For everything else, the live channel between the two agents is a tax on the developer's attention.

The next experiment is the MASDP pattern from the IEEE TSE paper: front-load the adversary's predictable warnings into the developer's initial prompt as anticipatory constraints and let the developer work through them without interruption. Whether that improves on plain post-hoc review I genuinely do not know yet. At least now I know what hypothesis I am testing, and which one has already failed.


If you maintain a multi-agent system that does live steering and you have seen it converge on hard problems, I would like to hear about it. Particularly the wall-clock-per-task numbers and what kinds of tasks it handles well. The literature I found was largely against the pattern. The counterexamples might be more interesting than the consensus.

pi-ensemble: agentic coding that optimizes for quality, not speed

I released pi-ensemble yesterday. It is alpha. The pattern it codifies is not.

For more than a year I have been running a multi-agent setup against my forked opencode: a project manager orchestrating specialist children, a mandatory adversarial gate before commits, a six-lens code review before merges. The combination is slower than letting a single agent rip through tickets, and it is not subtle about it. That is the whole point. pi-ensemble is the same workflow rebuilt as a clean extension to Pi, Mario Zechner's terminal coding agent. Same philosophy, fewer hacks.

The thing most agentic coding setups get wrong

The default optimization target for an AI coding agent right now is velocity. Lines of code per hour. Tickets closed per day. Time from prompt to PR. The frameworks lean into it: parallel workers, autonomous loops, "ship it" rhetoric. The metric is throughput.

This is the wrong metric if you care about whether the code is correct.

Codebases age. The cost of a bug found in review is a fraction of the cost of the same bug found in production. The cost of a security issue found by an adversarial pass is a fraction of the cost of the same issue found by Snyk in your dependency tree. The cost of a poorly-typed signature surfaced before merge is a fraction of the cost of refactoring around it three months later. Speed at the input does not save time at the output. It just shifts where you pay.

pi-ensemble is built on the opposite assumption: that the right number of agents in a workflow is "however many it takes to find what is wrong before you commit it."

The architecture

The parent pi you launch is the project manager. When you run a slash command, the extension injects PM doctrine into the system prompt for that turn. The PM then dispatches specialist children: each one is a separate Pi process spawned with pi --mode json -p --no-extensions --no-session --append-system-prompt <role.md>, with its own assembled prompt and its own context window.

Six roles ship:

RolePurpose
project-managerOrchestrates. Holds the workflow state.
developerImplements. Writes the code.
opsCommits, branches, PRs. Touches git.
exploreResearch. Web, codebase, prior memory.
adversarial-developerTries to break what the developer just wrote.
code-review-specialistOne of six lenses applied to a finished PR.

Children do not share context with the PM. They report back through structured tool calls. This keeps each specialist's context small and focused, and prevents the slow context contamination that single-agent workflows accumulate over a long session.

Five commands

/start          Initialise session: memory, codebase index, git/PR/CI state
/research       Fan out explore specialists in parallel
/plan           Draft a GitHub issue, classify, apply template
/work           Run an issue end-to-end
/review         On-demand six-pass review of a PR or path

The interesting one is /work. Hand it an issue number and it runs the full pipeline: feature branch, optional parallel worktrees, developer dispatches to implement, mandatory adversarial gate, ops commits, PR, six-pass code review, CI watch, merge per AGENTS.md policy. The two gates are not optional. They are the reason the setup exists.

Two gates

The adversarial gate runs before every commit. An adversarial-developer child receives the diff and the implementation context and is given one job: find what is wrong. Edge cases, missing error handling, off-by-one errors, security implications, behavioural assumptions that do not hold. If it finds something, the developer gets up to three rounds of fixes. Only then does the commit happen.

This catches a class of bugs that single-agent setups miss systematically. A single agent that wrote the code is the wrong agent to evaluate it: it carries the same assumptions, the same blind spots, the same confidence about what should work. An adversarial child with a different system prompt and no context contamination finds things the writer cannot see.

The six-pass code review runs before merge. Six children, each pinned to one lens:

LensLooks for
SecurityAuth holes, injection, secret handling, permission boundaries
Error handlingUnhandled paths, silent failures, recovery behaviour
Type safetyCoercions, nullability, invariant violations
PerformanceHot paths, allocations, N+1 patterns, sync-in-async
ArchitectureCoupling, dependency direction, separation of concerns
SimplicityCode that exists but does not need to

Findings come back as schema-validated report_finding tool calls. They get deduplicated by (path, line, title), precedence-merged so the highest severity for a given location wins, and turned into a verdict: APPROVED, ISSUES_FOUND, or CRITICAL_ISSUES_FOUND. The merge does not happen on a critical verdict without explicit override.

Six lenses are not arbitrary. They are the categories that I have seen single-agent reviews most consistently miss, distilled from a year of opencode runs and several thousand findings logged in the corresponding skill files.

What is underneath

pi-ensemble does not stand alone. It assumes a stack:

  • vipune for cross-session memory. Every agent calls it.
  • oo for context-efficient wrapping of chatty CLIs like git and gh. Without this the specialist windows fill with noise.
  • colgrep for semantic code search. Used to find existing implementations before writing new ones.
  • parallel-cli for web search and deep research. The explore role expects it.

This is the visible top of a longer-running effort. Each of these tools exists because something in the workflow needed it and the existing options were not good enough.

Per-role models

You almost certainly want a smarter model for the PM and a faster one for the specialists. pi-ensemble has a 5-layer resolution for subagent models, from per-call override down to a global default. Run /ensemble-model to pick interactively from whichever providers you have authenticated through Pi's /login (Anthropic, OpenAI, GitHub Copilot, Cerebras, whatever).

A typical configuration: Opus or Sonnet on the PM, fast Cerebras models on the lens reviewers, Sonnet on the developer and adversarial-developer. The cost arithmetic is roughly six children × two-thousand-token outputs plus context, per review cycle, which lands around $0.02 to $0.10 on Cerebras and considerably more on Anthropic. That is the tax for the quality gate.

Why Pi, not opencode

I have been running this same orchestrator pattern against my opencode fork for over a year. The maintenance cost of that fork is what eventually pushed me to migrate.

Opencode is a great piece of software. It is also a much larger surface area than I needed: a TUI, a web UI, a desktop wrapper, a server, a plugin layer, multiple packages. Every upstream change required reconciling against my own modifications, every dependency bump touched something I had to retest. The deeper I got into customising the orchestration layer, the more time I spent on integration rather than on the workflow itself.

Pi sits in a different place. The harness is lightweight: a single binary, a small system prompt that ships from the harness rather than being layered into your config, an extension model that is straightforward to develop against. There are no async tasks in either harness yet (this is the one thing I miss most), but Pi feels faster and gets out of the way more readily.

The ecosystem dynamics are also different. Anomaly has a real business model around opencode involving their Zen LLM API and hosted services, and the contribution rules reflect that. The current CONTRIBUTING.md is more open than it used to be (the old policy was effectively "no feature PRs from outside the core team"), but UI and core product changes still require a design review with the core team before implementation. For a fork that needs to evolve at the pace of my own experimentation, that is the wrong governance model. Not wrong in absolute terms; wrong for what I am trying to do.

Pi's extension story lets me ship the orchestration layer cleanly without forking the harness at all. pi-ensemble is an extension. So is pi-worktree. So is pi-permissions. Each is a separate concern, each can move on its own schedule, none of them require me to maintain a fork of the underlying tool. That is the part I could not get with opencode without spending most of my evenings on rebase.

If opencode is the right tool for you, keep using it. The work I did there is what made pi-ensemble possible. But the maintenance arithmetic stopped working for me, and the migration has paid for itself already.

The honest caveats

This is alpha. Things will change before 1.0. Specifically:

Permissions are not enforced per role. Specialists inherit Pi's default permissions. The role system prompt is the only thing keeping each in its lane. This is acceptable on a sandbox repo and not acceptable on anything you care about. pi-permissions will fix this.

The six-pass review costs real money. Not enough to matter on personal projects. Enough to matter if you fire it indiscriminately across a hundred PRs a day. Pin the cheap models for the lens reviewers; reserve the expensive ones for the PM.

Worktrees go through git CLI calls. They work, but the safer path is the pi-worktree plugin when its programmatic API stabilises.

It is tested on macOS. Linux should work. Windows almost certainly does not.

This is not for everyone. If your team measures developer output in story points and your AI integration is supposed to make those story points cheaper, pi-ensemble is the wrong tool. It will spend more tokens, take longer, and produce fewer PRs per day. What it produces will be better-reviewed, more defensible, and less likely to bite you later. That is a trade you have to actively want.

Why this exists

A year of running this pattern in opencode taught me that the bottleneck in agentic coding is not the model. It is the discipline applied to what the model produces. Models will happily ship reasonable-looking code that is wrong in non-obvious ways. The question is whether your workflow gives that code a serious chance of being caught.

pi-ensemble is one answer to that question. The interfaces will change. The philosophy will not.

Repo: github.com/randomm/pi-ensemble

Skills, not features: notes on a methodology that works for one person at a time

This post is overdue. I shipped /add-parallel into NanoClaw in early February. The merge happened the same day I opened the PR. I have been thinking about that experience ever since, not because the PR itself was complicated, but because the contribution model was not normal. Three months on, the pattern has a name (SkDD), a popular framework on top of it (Superpowers, around 95K stars), and a security audit calling it out as a supply chain risk (Snyk's ToxicSkills, 36% of audited skills flagged). It is time to write down what I actually think.

The PR

NanoClaw is Gavriel Cohen's container-isolated alternative to OpenClaw. Around 500 lines of TypeScript on trunk, agents run in actual Linux containers, security is enforced by the OS rather than by application-level allowlists. The contributing rule is exactly five words: don't add features, add skills.

So when I wanted to wire up Parallel AI as an integration, I did not add code to trunk. I wrote a SKILL.md that teaches Claude Code how to transform a NanoClaw fork to include Parallel. A user runs /add-parallel in their own checkout, Claude reads the skill, modifies the local code, and the integration is in place. Trunk never sees the diff. My PR added one skill file and a registration entry. That was it.

The strange part was not the code. The strange part was the conceptual flip. The PR did not extend NanoClaw. It added a recipe that lets each user extend their own copy of NanoClaw differently. That is the whole skills-driven development idea in one example.

The methodology, briefly

The pattern was formalized by Zak El Fassi in March under the name SkDD. Every build loop adds one decision gate: should this become a skill? If yes, you write a SKILL.md. The agent finds it, loads it, runs it next time. Three skill types: operational (do discrete work), meta (create other skills), composed (chain skills into pipelines). The compounding happens over months.

Jesse Vincent's Superpowers is the most-adopted implementation for individual development workflows. It bakes TDD, brainstorming, code review, and subagent-driven implementation into a skills library that works across Claude Code, Codex, Cursor, OpenCode, and Gemini CLI. The format converged organically rather than being formally standardized.

NanoClaw is the most opinionated project-level application of the philosophy I have seen. Other projects use skills as a developer's personal productivity layer. NanoClaw uses skills as a contribution model. The codebase deliberately stays minimal because the extensibility lives in the skills branch.

What it gets right

The trunk stays auditable. This is the security argument, and it is real. NanoClaw is small enough to read in an evening. You add the channel, agent provider, or integration you need; you do not inherit the security surface of fifty modules other users wanted. Compare to OpenClaw at ~400K lines: nobody is reviewing that codebase end to end.

The compounding is real for personal repos. I write a skill once. Future-Claude finds it and uses it. The skill survives session boundaries, model swaps, and harness switches. Three months in, I have skills that I no longer consciously remember writing but that get invoked automatically. That is genuine compounding.

It is harness-agnostic. A SKILL.md is just markdown. Claude Code, Codex, Cursor, OpenCode all read it. You are not locking yourself into a vendor's plugin system. This is the most underrated property of the format.

It forces small, composable units. A skill that tries to do too much is hard to write and unreliable to invoke. The format pushes you toward single-responsibility units, which is the same discipline that makes good Unix tools.

Where it breaks

This is the part I do not see written down enough.

Heterogeneous state is the default outcome. When every user runs different /add-* commands against their own fork, no two installations are the same. For a personal AI assistant, that is the point. For a team product, it is a disaster. You cannot debug a deployment when "the codebase" is a hypothesis rather than a fact. You cannot do incident response when the production environment is a snapshot of one developer's skill choices six months ago.

Skills-driven development does not generalize to team production code. It works for personal projects, dev tooling, and individual workflows. It does not work for software multiple humans need to reason about together. The whole point of trunk-based development, code review, and shared conventions is to keep the team's mental model of the system in sync. Skills-driven development inverts that: each fork drifts intentionally.

The output depends on the LLM. Two users running the same /add-telegram skill with Claude Opus 4.7 and GPT-5.2 get different code. Sometimes meaningfully different. The skill is a prompt, not an executable. The result is what the model decides to do with that prompt in that context. For a deterministic build, this is unacceptable. For a personal assistant that gets close enough to what you wanted, it is fine. Know which one you are building.

Skill quality is mostly invisible to the user. Reading a SKILL.md does not tell you what the agent will actually do. The instructions look reasonable. The output may not be. You find out by running it, which is fine on a personal fork and dangerous when the skill touches credentials, deploys code, or modifies shared infrastructure.

The security context is worse than the methodology suggests

Snyk's ToxicSkills audit (February 2026) scanned 3,984 skills from ClawHub and skills.sh. 36.82% had at least one security issue. 13.4% had critical issues including malware, credential theft, and prompt injection. There is no package-signing standard. There is no central review. The format converged faster than the supply chain hygiene around it did.

The lesson is not that skills are bad. The lesson is that "just install this skill" should carry the same suspicion as "just run this curl pipe to bash". Read the skill before you run it. If you would not paste the contents of the SKILL.md into your terminal manually, do not let an agent do it for you.

Where this actually fits

Skills-driven development is a good fit for personal AI assistants (NanoClaw is the proof), individual developer workflows (Superpowers is the proof), and exploratory tooling where each user's needs diverge by design. The compounding is real and the trunk-minimalism is genuinely useful.

It is not a fit for team production code, regulated environments, or anywhere multiple humans need a shared understanding of what the system does. Determinism matters in those contexts, and skills-driven development trades determinism for compounding personal capability. That is a fine trade for one person. It is a bad trade for an organization.

The reason this is worth being explicit about is that the writeups currently in circulation treat SkDD as a general-purpose development methodology. It is not. It is a specific tool for a specific class of problem. For that class, it works well. For everything else, it produces faster chaos.

/add-parallel was the right way to contribute Parallel AI support to NanoClaw. It would have been the wrong way to add Stripe support to a company's billing service. The methodology is not universal. Knowing which side of that line you are on is the part the hype is glossing over.

SkDD methodology | Superpowers | NanoClaw | Snyk ToxicSkills audit

vipune 0.5: agent memory without the agent

There is no shortage of agent memory systems. mem0, Letta, Zep, Voltropy's LCM, Claude Code's built-in memory, Cursor's context engine. My colleague Topi's Remind is a recent addition to the space. The common pattern across most of these is LLM-based distillation: raw data goes in, an LLM extracts higher-value memories, generalizations, or concepts. mem0 has been doing this for a while. Remind pushes it further with spreading activation retrieval, entity graphs, and outcome tracking, but the core technique is the same.

vipune does not use an LLM. That is a deliberate choice. You store text, it gets embedded locally (ONNX, bge-small-en-v1.5), and you search by meaning. No distillation step, no external model calls, no token cost for memory operations. The tradeoff is obvious: you do not get automatic generalization. What you get is a single binary with no dependencies that runs the same everywhere, offline, with zero configuration.

The 0.5 release adds CLI flags that make this simpler tool useful for things the larger systems were not designed for: scoped multi-agent memory within a single session, typed retrieval, and recency-weighted search that lets you use the same store as both long-term knowledge and short-term working memory.

(I have been toying with the idea of adding an optional distillation step via apfel, the CLI that exposes Apple Intelligence's on-device model on macOS 26+. A local 3B model with no API keys could handle memory consolidation without breaking vipune's zero-dependency, zero-cost model. I have been experimenting with apfel for other things and the on-device inference is fast enough to be practical. But that is future work, not a promise.)

Before getting into the new flags, it is worth looking at one of the more interesting architectural approaches in this space.

What Volt LCM gets right

Voltropy's Volt introduced Lossless Context Management earlier this year. The LCM paper is worth reading. The core insight: stop asking the model to manage its own memory and let the engine do it deterministically. LCM maintains a DAG of hierarchical summaries in a persistent store. Compaction happens asynchronously between turns. Nothing is lost. Sessions can run indefinitely.

The approach is sound. Volt performs well on long-horizon tasks because the model does not have to invent a memory strategy on the fly.

The limitation is structural. Volt is a complete terminal-based coding agent, forked from OpenCode. You get LCM, but you also get the entire agent runtime. If you are building your own harness, or running Claude Code, or using Cursor, Volt's memory is not something you can pull out and use separately.

vipune: just the memory

vipune is the memory layer without the agent. Single binary. No API keys, no daemon, no database server. Everything runs locally using ONNX embeddings (bge-small-en-v1.5). Install it, start using it.

cargo install vipune

Or grab the binary directly:

curl --proto '=https' --tlsv1.2 -LsSf \
  https://github.com/randomm/vipune/releases/download/v0.5.0/vipune-installer.sh | sh

With 0.5, the feature set covers the main things needed for multi-agent memory workflows. Three capabilities in particular.

Multi-agent scoping

Multiple agents can share a single vipune instance. Inside a git repo, vipune infers the project scope from the repository, so agents working in the same repo share memory by default:

# Both agents are in the same repo — memories are shared automatically
vipune add "Auth service uses JWT with RSA-256"
vipune search "authentication flow"

When you need isolation between agents in the same session, --project overrides the default:

# Agent A: backend scope
vipune --project "myapp/backend" add "Auth service uses JWT with RSA-256"

# Agent B: frontend scope
vipune --project "myapp/frontend" add "Token refresh handled in useAuth hook"

Outside a git repo, set VIPUNE_PROJECT to scope manually. Either way, each agent gets its own memory namespace, or shares one deliberately.

Typed memories

Not all memories are the same. A design decision is not the same thing as a guardrail. vipune now has five memory types:

TypePurpose
factDefault. Statements about the world.
preferenceHow the user or system prefers things done.
procedureStep-by-step processes.
guardThings that must not happen.
observationTransient notes, intermediate findings.
vipune add "Never deploy to prod on Fridays" --memory-type guard
vipune add "Run migrations before schema tests" --memory-type procedure
vipune search "deployment rules" --memory-type guard,procedure

Agents can filter by type on search. A coding agent looking for guardrails does not need to wade through every factual observation from the last three days. This keeps search results relevant as the memory store grows.

The --status flag adds another axis. Memories start as active or candidate. The --supersedes flag atomically replaces an old memory with a new one:

vipune add "Alice now works at Google" --supersedes abc123-old-memory-id

One transaction. The old memory is marked superseded, the new one is active. No window where both are live.

Tunable recency

Search results combine semantic similarity with a time decay score:

score = (1 - recency_weight) * similarity + recency_weight * time_score

The default is 70% semantic, 30% recency. For long-term project memory, that balance works. For a single programming session where you want recent context to dominate:

vipune search "what did I just decide about the API" --recency 0.8

Now 80% of the ranking comes from how recent the memory is. This is what turns vipune from a knowledge base into working memory. Same binary, same data, different search behaviour depending on what you need right now.

For full-text matching, --hybrid enables BM25 alongside semantic search:

VIPUNE_HYBRID=true vipune search "JWT RSA-256"

Useful when you need exact keyword hits, not just meaning.

MCP server

vipune runs as an MCP server out of the box:

vipune mcp

This exposes store_memory, search_memories, list_memories, and supersede_memory as native tools. Claude Code, Cursor, and anything else that speaks MCP can use vipune as a memory provider without shell command wrappers. The MCP tools accept the same type, status, and filter parameters as the CLI.

Where this fits

The bigger memory systems do more than vipune. Letta has structured agents with memory tiers. Volt LCM has hierarchical DAG compaction. mem0 has managed cloud infrastructure. If you need those things, use those tools.

The niche vipune occupies is narrower. You have an agent (or several), you want them to remember things across turns or sessions, and you do not want to add a service, a daemon, or an account to make that happen. You want a binary you can call from a shell or expose over MCP. The --project, --memory-type, and --recency flags are what make that narrow niche practical for real workflows instead of just toy examples.

# In your CLAUDE.md or agent instructions:
# Use `vipune search` before starting work to check for relevant context.
# Use `vipune add` to store decisions, discoveries, and guardrails.
# Use `vipune add --memory-type guard` for things that must not be forgotten.

That is the whole integration.

GitHub | crates.io | CLI Reference

New Strix Halo? Five things that will cost you hours.

The AMD Ryzen AI MAX+ 395 — Strix Halo — is a genuinely interesting machine for running large language models locally. 128 GB of unified memory, real world reported ~225 GB/s memory bandwidth, an iGPU that can access the full pool. You can run a 35B MoE model fully on-GPU at 40–55 t/s. No cloud, no API bill, no data leaving the machine.

Getting there takes some work. The hardware is capable but the software stack has sharp edges. Here are the five things that cost the most time on a fresh setup, and how to fix them. For the full step-by-step guide, see the Strix Halo setup reference.


1. You are only getting 61 GB of GPU memory

Out of the box, ROCm caps GPU memory allocation to ~61 GB on APUs. The machine has 128 GB. Without the right kernel parameters, roughly half of it is inaccessible to your models.

Add these to GRUB_CMDLINE_LINUX in /etc/default/grub:

iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856 amdgpu.no_system_mem_limit=1 amdgpu.cwsr_enable=0

Then:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg && sudo reboot

Verify afterwards:

echo "scale=1; $(cat /sys/class/drm/card*/device/mem_info_gtt_total | head -1) / 1024^3" | bc
# Should print: 124.0

The key parameter is amdgpu.no_system_mem_limit=1. Without it, the kernel silently caps allocations regardless of what you configure elsewhere.


2. Your GPU is running at 600 MHz

The GPU clocks down to ~600 MHz when idle and does not boost appropriately during inference unless you intervene. The result is roughly 50% of the throughput you should be getting, with no error message — the model just runs slow.

echo "high" | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level

This does not persist across reboots. Create a systemd service to set it on startup — the full guide has the unit file.


3. Two llama-server flags are non-negotiable

On Strix Halo, these two flags are mandatory for every llama-server invocation:

-fa 1 --no-mmap

-fa 1 enables flash attention. Without it, inference crashes. --no-mmap disables memory mapping, which causes silent data corruption or hangs on Strix Halo unified memory. Neither produces a helpful error when missing — you just get crashes or wrong outputs.

Always include both. Put them in whatever launch script or config you use so they cannot be accidentally omitted.


4. Toolbox from SSH silently fails

If you are running llama-server from an SSH session or a systemd service via a Podman toolbox container, you will hit this:

crun: sd-bus call: Access denied

The cause: Podman defaults to the systemd cgroup manager, which requires polkit interactive auth. On a headless server, that auth never happens.

Fix:

mkdir -p ~/.config/containers
cat >> ~/.config/containers/containers.conf << 'EOF'
[engine]
cgroup_manager = "cgroupfs"
EOF

Test it:

toolbox run echo "cgroup test OK"

5. ROCm containers do not release GPU memory between model swaps

If you are using llama-swap to hot-swap between models, be aware that ROCm/HIP retains the GPU memory pool within a container namespace after the llama-server process exits. Vulkan/RADV releases memory immediately. ROCm does not.

The consequence: if two different ROCm models share the same container, the second model will fail to allocate GPU memory after the first has been evicted. The fix is to create a separate container for each ROCm model, even if they use the same image:

toolbox create --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7.2.1-pr21344 llama-rocm-pr21344
toolbox create --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7.2.1-pr21344 llama-rocm-pr21344-b

Then assign each model config its own TOOLBOX_CONTAINER. Vulkan models can share a container without issue.


Once you have cleared these, the machine runs well. Pre-built containers for gfx1151 are at kyuz0/amd-strix-halo-toolboxes and cover the three backend variants you will need. The full setup guide covers everything from BIOS to model downloads.

One to watch: TurboQuant KV cache compression

TurboQuant (Zandieh et al., ICLR 2026) promises significant KV cache compression — the current implementation in TheTom/llama-cpp-turboquant achieves 5.12x vs FP16 with block_size=128, and a recent sparse V dequantization optimization gives an additional +22.8% decode speed at 32K context. On a 128 GB machine that translates to meaningfully more usable context before you hit memory limits.

HIP/ROCm support is confirmed in TheTom's fork, so it is usable today on Strix Halo. The flag syntax is --cache-type-k turbo3 --cache-type-v turbo3. Note this differs from the standard -ctk/-ctv flags.

It is not merged into llama.cpp mainline yet. The main discussion is at #20969 (162 comments, very active) and the feature request at #20977. No formal upstream PR has been submitted yet. That merge is the trigger to watch — when it lands, mainline llama.cpp gets it and the kyuz0 containers will pick it up on the next rebuild.

You feeling the squeeze yet?

The era of predictable AI costs is over. Flat-rate subscriptions are disappearing from enterprise tiers, usage-based billing is replacing them, and the per-token rates underneath are going up. If your team is doing anything agentic, this is not an abstract pricing discussion.

The numbers

OpenAI moved first. GPT-5.2 is up 40% across the board:

OldNew
Input$1.25/M tokens$1.75/M tokens
Output$10.00/M tokens$14.00/M tokens

Anthropic's April enterprise changes are sharper. Before: $200/user/month, flat, with discounted tokens included. After: $20/seat base plus per-token billing at full API rates plus a monthly spending commitment. For some customers that is a 3x cost increase, not a 40% one.

There is also a quieter change. Claude Opus 4.7 shipped a new tokenizer that produces 1.0 to 1.35 times more tokens for the same input text. The published rate card did not change. The effective cost went up roughly 40% anyway.

Why agentic workflows feel this differently

A developer chatting with an LLM uses tokens in bursts. An agent running in a loop uses them continuously, across multiple subagents, with long context windows, on every turn. Claude Code weekly active users doubled between January and February 2026 alone. That kind of growth is exactly why providers are raising prices, and it is exactly why those raises hit agentic teams hardest.

Average Codex cost is now being quoted at $100 to $200 per developer per month. Multiply that across a team of 50 and you have a real budget line, not a rounding error.

What changes because of this

Budget predictability is gone. Monthly spending commitments create a new category of financial risk that your infrastructure team is probably not set up to manage yet. API billing is flexible but has no ceiling.

Single-model strategies are becoming expensive bets. The cost spread between frontier models and capable cheaper alternatives has widened enough to make routing worth the engineering effort. xAI Grok 4.1 at $0.20/M input and $0.50/M output exists. Haiku 4.5 exists. Not every task in your agent pipeline needs the most expensive model.

Prompt caching and batch processing stopped being optional optimizations around the same time these price increases landed. Agentic loops that minimize round-trips are no longer just good practice. They are financially material.

Open weight models are a real option

For programmer and devops agents, Minimax M2.7 is capable enough to carry the workload. For orchestration, GLM-5.1 and Qwen3.6 (the plus variant) are worth a serious look. For teams with full air-gap requirements, Qwen3.6 MoE self-hosted removes the API dependency entirely. No token bills, no data leaving your infrastructure, no rate limits.

The tradeoff is operational: you are running infrastructure, managing updates, owning the uptime. For organisations facing six-figure annual API commitments, that tradeoff is starting to look different than it did twelve months ago.

The structural point

These increases are not a correction back to some future lower price. The compute demand from agentic development is real and growing, providers are constrained, and they are pricing accordingly. The organisations that treat AI infrastructure costs as a first-class architectural concern now will be in a better position than the ones that treat it as a line item to revisit later.

So: are you feeling the squeeze yet? If not, you probably will be.

Feynman: a research agent worth the rough edges

Two weeks in with Feynman, the open source research agent built on the Pi runtime. The short version: rough around the edges, genuinely useful.

What it does

Feynman is a CLI tool. You ask a research question, it dispatches four parallel agents: a researcher pulling from papers and the web, a reviewer running simulated peer critique, a writer producing structured output, and a verifier checking every citation and killing dead links. Every claim in the output links back to a source. That last part matters more than it sounds.

The other standout feature is feynman audit <arxiv-id>: give it a paper and it compares the stated claims against the actual public codebase. How often does published research actually match the code? Turns out, not always.

How I use it

Two patterns have stuck.

The first is gut-feeling verification. I work in software and AI. You accumulate opinions fast, and not all of them survive contact with the literature. Before I put something in writing or stake a position in a conversation, I'll run it through Feynman. Sometimes it confirms what I thought. Sometimes it finds a paper that reframes the question entirely. Either way I come out better informed than I went in.

The second is writing support. When I am drafting something and need a citation I am typically too lazy and go with gut feeling. Feynman changes that for someone like me. Now I ask Feynman. The verifier agent means I am not getting hallucinated references. That is a real change in how much I trust the tool versus how much I trust my own search habits.

The rough edges

It is a young project. The main friction I hit: subagents do not automatically inherit the LLM provider of the main agent. If you are running against anything other than Anthropic, subagents silently fall back, hit a missing API key, and fail. You do not always notice immediately.

Two things cause this. First, Pi does not expand environment variables in agent frontmatter. If an agent file says model: ${OPENAI_MODEL}, Pi reads that as the literal string, not your configured provider. Second, Pi-subagents only looks for agent definitions in specific paths (.agents/ or .pi/agents/). If agents end up anywhere else, Pi falls back to its own bundled definitions, which hardcode anthropic/claude-sonnet-4-6. That is where the missing API key error comes from, even when you have a different provider set up correctly.

The fix is to expand env vars at bootstrap time, before agent files are written to the path Pi actually searches. The file lands with the model string already baked in, not a placeholder. Feynman does not do this yet out of the box.

When subagents fail, Feynman degrades to the main agent or whichever agent has a working pipeline. The system prompt may not be optimised for the task, but the output is still useful. I have a fork that wires this up properly and adds parallel-cli as the search backend. If you run into the same issue, it might save you some time.

Worth it

The project has the right idea about what matters in a research tool: source-grounded output, not plausible-sounding summaries. For a two week old piece of software it is further along than I expected. I will keep using it.

System prompts steer. Permissions stop.

An AI agent deleted a production database this week. The thread is worth reading, but one detail stands out: Cursor's system prompt explicitly said "don't run destructive operations." The agent ran one anyway.

That is not a bug in the agent. It is a misunderstanding of what a system prompt is.

What a system prompt actually is

A system prompt is a suggestion. A nudge. Something the model considers before deciding what to do. Useful for shaping behaviour, setting context, establishing tone. Not useful for preventing a specific action from happening.

The model reads the instruction and decides how to honour it. Most of the time that works fine. But "most of the time" is not a safety guarantee, and the cases where it fails tend to be the expensive ones.

Where enforcement actually lives

The real control surface is permissions: what commands can run, with what arguments, against what resources. That is where you get enforcement, because it operates outside the model's reasoning loop entirely.

The granularity matters. Consider two commands:

git push origin main
git push --force origin main

Same base command. Completely different blast radius. One is routine, expected, recoverable. The other rewrites history and may be irreversible depending on your remote configuration. Treating them as a single "git push" permission, both allowed or both blocked, is exactly the coarse-grained thinking that causes problems.

What you want is argument-level granularity: allowlists with patterns, denylists for destructive variants, confirmation prompts for anything in a grey zone. A real gate, applied at execution time, not in instructions the model gets to interpret.

What CLAUDE.md and AGENTS.md are for

These files sit even further from enforcement than system prompts. They are Markdown the agent might read, might honour, and might weight differently depending on context. Useful for giving an agent operational context about a project. Not a safety mechanism. Treating them as one is the same mistake as trusting the system prompt to stop a destructive operation.

Devcontainers are a different thing entirely

Containers are useful, but they solve a different problem. A devcontainer limits what the agent can break if something goes wrong. It does not control what the agent decides to do. Sandboxing manages blast radius; it does not replace permission gates.

You need both. A container without argument-level permissions still lets an agent force-push inside its allowed remotes. Permissions without a container still let a misbehaving agent affect the host. They are complementary, not alternatives.

What good looks like

Some tools already get this right. OpenCode has had granular, per-agent permissions for a while. Different subagents, different permission scopes, different blast radius caps. It works, and running it that way for the better part of a year has not cost anything in productivity.

The work to get there is tedious but not complicated: map the actual command surface your agents touch, decide what is safe to allow, what needs a confirmation prompt, and what is hard-blocked, on which paths, against which remotes. Nobody enjoys doing this. It is also the only thing standing between your agent and a force push to main on a Friday afternoon, inside a container or not.

I am building a permissions extension for the Pi agent runtime along the same lines. The gate has to live at the command-and-args layer, where execution happens, not in instructions the model is expected to follow.

System prompts steer. Permissions stop.

BYOT: Bring Your Own Tokens

I keep hearing versions of the same story from consultants and in-house devs alike: "I just used Claude Code to fix that bug." Own API key, own account, company codebase. Wild west.

Does the CTO know? Compliance? Legal?

What's actually happening

When an AI agent "reads" a file or "investigates" a database issue (even with read-only access), that content gets pulled into the context window. From there it makes a round trip to an LLM API. Every turn. Source code, schemas, migration files, logs, error messages with real user data baked in.

This is not a theoretical attack surface. This is the default behaviour, happening right now, in organisations that haven't thought carefully about it. The developer isn't being malicious. They're being productive. That's exactly what makes it hard to catch.

Four questions worth asking before the next sprint

1. Which LLM APIs are actually being used in your codebase right now?

Not which ones you've approved. Which ones are actually running. Personal accounts don't show up in your procurement or security tooling. You won't find them in your firewall logs unless you're specifically looking for the right hostnames. Start by asking your developers directly. You may be surprised.

2. Do the terms of those APIs exclude training on your data?

Most major providers offer enterprise tiers with explicit no-training commitments. Personal accounts are a different story. Read the terms. Some providers are unambiguous; others are not. "Your data is not used for training" in a consumer product FAQ is not a legal guarantee. Get it in writing, in a contract, before your code is in the context window.

3. Where does the data land: EU, US, somewhere else entirely?

Data residency matters under GDPR, and increasingly under other frameworks too. A Finnish developer chasing a bug with a personal API key may be routing company data through inference infrastructure in a jurisdiction your DPA has never heard of. Saying "we didn't know" is not a GDPR defence. Saying "it was just a developer tool" is not a GDPR defence either.

4. What happens when an agent pulls PII into context while chasing a user-reported bug?

This one deserves to be said plainly. A developer gets a bug report: "user ID 48291 sees the wrong balance." They hand it to their agent. The agent reads the relevant database query, checks the logs, pulls a sample row to understand the schema. That sample row may contain a real name, a real address, a real transaction history. It is now in the context window of an API running on infrastructure you don't control, under terms you haven't reviewed, in a jurisdiction you may not know.

GDPR doesn't care that it was a quick fix.

The answer is not to ban agentic tooling

That ship has sailed, and it was a good ship. Agentic development workflows are genuinely useful. Trying to prohibit them will just push usage further underground.

The answer is to own your LLM APIs. At the very least, understand how the technology works and buy tokens from a provider whose terms, data residency, and security posture actually match your compliance requirements. Enterprise agreements with the major providers are not expensive relative to the risk. They give you audit trails, data processing agreements, and no-training commitments in writing.

Set up a company account. Give your developers access. Make it easier to do the right thing than to reach for a personal key.

Where a lot of organisations are right now

Half their codebase round-tripping through three different inference providers on personal accounts. No audit trail. No DPA. No idea.

It's unfathomable. And yet, exactly where a lot of organisations are right now.

BYOT is not a developer problem. It's a leadership problem. The people setting engineering culture and tooling policy need to get ahead of this. Before the compliance team finds out the hard way.