H100 vs Strix Halo: prefill and parallelism are the real gap
I spent yesterday running a 10-test benchmark sweep against an UpCloud H100 80GB SXM, serving Qwen3.6-27B-FP8 under vLLM 0.21. Then I compared it to the numbers I have been running on my Strix Halo box (Qwen3.6-35B MoE Q8_0 under llama.cpp). The models are not identical. They are in the same weight-footprint class (28-34 GB), but one is a 27B dense FP8 and the other is a 35B MoE Q8. Treat the comparison as directional, not as a head-to-head.
The context for this work: at Trail Openers we are moving toward more environmentally sustainable LLM infrastructure for our internal tooling, and the H100 setup we are building will be shared by several developers, not just by me. UpCloud was chosen for three reasons that compound. First, their data centres run majority on renewable energy and their scope 1+2 emissions are compensated. Second, the fi-hel2 data centre is in Helsinki, the same metropolitan area as Trail Openers, which means the whole team gets single-digit-millisecond latency to the endpoint. Third, EU jurisdiction and GDPR-native data handling matter for the work we do; the prompts, the diffs, the codebases, none of that leaves the EU. The benchmark sweep was the empirical part of deciding whether the architecture actually pays off for that team-sized, sustainability-conscious, locally-hosted use case.
The directional answer surprised me in two ways. Single-user generation is closer than the price tag suggests. Prefill and concurrent serving are not close at all.
Single request, side by side
The first observation that matters is how small the single-user generation gap looks at first glance.
| Metric | H100 (Qwen3.6-27B-FP8) | Strix Halo (Qwen3.6-35B MoE Q8) | Δ |
|---|---|---|---|
| Prefill rate (~512 input tokens) | 11,378 t/s | 1,388 t/s | ~8.2× |
| Token generation rate | 77 t/s | 54 t/s | ~1.4× |
This is where the non-identical-model caveat actually bites. Strix Halo is running an MoE variant, which has roughly the same total parameter count as a dense 27B but activates only a fraction of those parameters per token. MoE is the shape Strix Halo's memory architecture happens to suit: ~225 GB/s bandwidth is enough to stream the active experts and not much else. If you forced Strix Halo to run a dense 27B at the same quantisation, the generation rate would drop materially even with MTP turned on. The 1.4× number is partly Strix Halo being clever about what model you let it have, not Strix Halo being equivalent hardware.
The H100 does not care about this distinction in the same way. It has the bandwidth to run either shape competitively. So the realistic reading of the 1.4× is "Strix Halo with the model that suits it, against H100 with the model H100 is serving." If both ran a dense 27B at the same precision, the generation-rate gap would widen significantly.
The 8.2× on prefill is the gap that does not depend on model shape. Any workload that involves long context (RAG, multi-file code edits, large diff reviews) pays the prefill cost. On Strix Halo, time-to-first-token on a 7k-input prompt is over 400 ms before the model has decided anything. The H100 absorbs that prefill at roughly an order of magnitude higher rate. For agentic coding, where the agent re-loads files and re-reads context constantly, prefill latency is the dominant factor in perceived responsiveness.
The 8.2× on prefill is where the asymmetry starts. Any workload that involves long context (RAG, multi-file code edits, large diff reviews) pays the prefill cost. On Strix Halo, time-to-first-token on a 7k-input prompt is over 400 ms before the model has decided anything. The H100 absorbs that prefill at roughly an order of magnitude higher rate. For agentic coding, where the agent re-loads files and re-reads context constantly, prefill latency is the dominant factor in perceived responsiveness.
Concurrency is the architectural win
The bigger finding is what happens when you stop being one user.
llama.cpp does not batch concurrent requests. One process, one user, sequential. If you have two agents asking questions at the same time, the second one waits for the first one to finish. This is fine for a personal AI box. It is not fine for a multi-agent setup.
vLLM batches concurrent requests through the same forward pass. The H100 scales throughput nearly linearly with concurrency until you saturate the GPU. Here is the mid-context shape (4k input, 512 output, the rough size of an agentic-coding turn):
| Concurrency | Total output t/s | Per-agent t/s | TTFT median | TPOT median |
|---|---|---|---|---|
| 1 | 75 | 75.2 | 243 ms | 12.8 ms |
| 4 | 259 | 64.8 | 876 ms | 13.8 ms |
| 8 | 441 | 55.1 | 1.31 s | 15.6 ms |
| 16 | 684 | 42.7 | 1.42 s | 20.3 ms |
| 32 | 929 | 29.0 | 1.88 s | 31.1 ms |
Throughput scales 12.4× from concurrency 1 to 32. Per-token latency degrades smoothly (12.8 ms to 31 ms median). The cost shows up in TTFT, which jumps from sub-second to roughly two seconds median (and a P99 tail that hits 7.5 s at concurrency 32).
The sweet spot is 8 to 16 concurrent agents. At 16 you get 43 t/s per agent and 684 t/s aggregate. That is enough to run a serious multi-agent setup (a project manager, several developers, an adversarial reviewer, a six-prong code-review fan-out, plus some headroom) on one GPU without the per-agent experience degrading below "usable."
Compare to Strix Halo serving the same load: 16 agents asking for ~200 tokens each takes roughly 60 seconds sequentially. The H100 does it in roughly 5 seconds. That is the 12× effective-serving advantage. It is not in any single number. It is in the architecture.
The cost picture
UpCloud charges €1.79/hr for H100 business hours. The per-million-token cost at different operating points:
| Operating point | Output t/s | €/M output tokens |
|---|---|---|
| Single agent | 75 | €6.63 |
| 8 concurrent | 441 | €1.13 |
| 16 concurrent | 684 | €0.73 |
| 32 short prompts | 1,284 | €0.39 |
For a small dev team running 8-16 agents during work hours, the marginal output cost is well under €1 per million tokens. Anthropic's GPT-5.2-class pricing is now $14 per million output tokens. Even accounting for the reasoning-trace overhead (more on that below), the gap is two orders of magnitude.
This is the part of the picture that has changed in the last six months. Frontier API prices went up sharply at the same time as serving stacks (vLLM, SGLang, TensorRT-LLM) got materially better at batching. The economic case for self-hosting an agentic backend on rented GPU is now defensible in a way it was not last year.
What this means in practice
Two specific deployment shapes look obviously correct now.
One Strix Halo per developer for personal use, with the right model shape. For one human at one keyboard running an MoE model that fits the memory architecture (Qwen3.6 MoE is the obvious example), Strix Halo is a reasonable personal-AI machine. 128 GB unified memory, real bandwidth, no API bill, no data leaving the box. The caveat is model selection. If you want to run dense models in the 27B+ class at usable speeds, this is not the machine. Pick the MoE variants that suit the bandwidth profile. The Strix Halo setup guide and the gotchas post cover what it takes to actually get there.
One H100 (rented) per team for multi-agent backends. The moment your workflow runs more than one agent at a time, the architecture matters more than the silicon. vLLM's continuous batching turns one GPU into a serving fleet. pi-ensemble dispatches up to six specialist children per /work invocation; on Strix Halo those run sequentially, on an H100 they run concurrently. A team of four to six developers each running pi-ensemble-style workflows fits comfortably inside the 8-16 concurrency sweet spot on a single H100. Spin it up during business hours, spin it down overnight, and the marginal cost lands well below frontier API rates while keeping the data on infrastructure you control.
This is the shape we are using at Trail Openers. One shared H100 in UpCloud's Helsinki data centre, internal endpoints behind Caddy, business-hours uptime. The decision to host locally rather than reach for a hyperscaler GPU instance was driven by three things at once: the sustainability footprint (UpCloud's energy mix is majority renewable and their ESG reporting covers scope 1, 2, and 3 with compensation for scope 1+2), the data-jurisdiction story (everything stays in the EU, GDPR-native, no extra-territorial transfers), and the simple fact that the machine sits in the same metropolitan area as the team using it. Latency from a developer's desk in greater Helsinki to a model running in fi-hel2 is dominated by the local fibre hop, not by any cross-continent route. None of this makes inference free of footprint (it never is), but it shifts the marginal cost of an extra agent run onto a cleaner grid, in a friendlier jurisdiction, with materially better latency than the default AWS/GCP region you would otherwise reach.
The caveats are real
Three things to know before you act on this.
Reasoning models burn most of the output budget on chain-of-thought. Qwen3.6 emits roughly 80% of every response as reasoning trace before the final answer. The 512-token output budgets in the table above are mostly thinking, not result. If you want 200 tokens of actual answer, plan for 1,000-1,500 output tokens per agent. The effective useful throughput is roughly half of what the raw numbers suggest. A non-reasoning model, or chat_template_kwargs.enable_thinking: false, would shift this picture significantly.
P99 TTFT under load has a long tail. At concurrency 16 the median TTFT is 1.4 s but the P99 is 3.7 s. At concurrency 32 the P99 is 7.5 s. Continuous batching is fundamental: if interactive consistency matters more than aggregate throughput, cap at 8.
First-deploy cold start is expensive. torch.compile takes about 20 minutes on first run. Subsequent boots reuse the cache and reach /health 200 in roughly 2 minutes. Worth knowing if you spin instances up and down.
The takeaway
I came into this expecting the H100 to be dramatically faster at everything. For raw single-user generation on a model that suits Strix Halo's architecture, it is not. It is 40% faster, and even that gap closes if you cherry-pick the right MoE variant for the AMD box. That part of the comparison surprised me.
The H100's value lives in two places Strix Halo structurally cannot match. It absorbs prefill at 8× the rate, which dominates perceived latency on any non-trivial context regardless of model shape. And it serves concurrent requests through one forward pass, which turns one box into a multi-agent backend instead of a single-user terminal.
For our use at Trail Openers, that lines up cleanly. Strix Halo on the desk for personal work, running MoE models that fit the bandwidth profile. A shared H100 in UpCloud's Helsinki data centre during business hours, serving the team's multi-agent backends with the sustainability footprint, EU data residency, and same-city latency we wanted. Different tools, different shapes of problem, different answers.
Benchmark details: vLLM 0.21.0 with max_model_len=8192, max_num_seqs=256, gpu_memory_utilization=0.9. Tool: vllm bench serve with --dataset-name random --ignore-eos. Hardware: UpCloud H100 80GB SXM, single GPU, fi-hel2. Strix Halo numbers from my own BENCHMARKS.md, best of ROCm pr21344 / Vulkan RADV / ROCm 7.2.3 backends.