H100 vs Strix Halo: the gap is bigger than the first benchmark suggested

I have been benchmarking Qwen3.6 on an UpCloud H100 80GB SXM, comparing it against the numbers I have been running on my Strix Halo box. The first sweep, four days ago, used Qwen3.6-27B-FP8 (dense) on the H100 under vLLM 0.21, and my established Qwen3.6-35B MoE Q8 numbers on Strix Halo under llama.cpp. The directional answer it gave me was "single-user generation is closer than you would think; prefill and concurrency win." That answer turns out to have been half-right and half-misleading. I ran the like-for-like comparison today and want to correct the picture.

The context for this work has not changed. At Trail Openers we are moving toward more environmentally sustainable LLM infrastructure for our internal tooling, and the H100 setup we are building will be shared by several developers. UpCloud was chosen for three reasons that compound. First, their data centres run majority on renewable energy and their scope 1+2 emissions are compensated. Second, the fi-hel2 data centre is in Helsinki, the same metropolitan area as Trail Openers, which means the whole team gets single-digit-millisecond latency to the endpoint. Third, EU jurisdiction and GDPR-native data handling matter for the work we do; the prompts, the diffs, the codebases, none of that leaves the EU.

What the first sweep got wrong

The first benchmark ran a dense 27B on the H100 against an MoE on Strix Halo. That comparison was fair on weight footprint (28-34 GB class) but unfair on architectural fit. Dense and MoE behave very differently per-token, and the model choice on each side biased the result in opposite directions.

The summary numbers from that first sweep:

Metric	H100 (Qwen3.6-27B dense FP8)	Strix Halo (Qwen3.6-35B MoE Q8)	Δ
Prefill rate (~512 input tokens)	11,378 t/s	1,388 t/s	~8.2×
Single-request generation rate	77 t/s	54 t/s	~1.4×

The 1.4× looked like good news for Strix Halo. It is not. It is the artefact of two model choices, not a real architectural finding. Strix Halo was running the MoE variant that suits its ~225 GB/s bandwidth (only the active experts have to be streamed per token). The H100 was running a dense model where every parameter is touched on every forward pass. That is not a fair fight in either direction.

The fair comparison is running the same MoE variant on both, with each platform's best serving stack and best decoding tricks. So I did that.

The like-for-like sweep

Yesterday I redeployed the H100 with Qwen3.6-35B-A3B-FP8 (the same 3B-active / 35B-total MoE Strix Halo runs) plus MTP speculative decoding turned on (--speculative-config '{"method":"mtp","num_speculative_tokens":2}'). MTP runs the model's built-in multi-token-prediction draft heads in parallel with the main forward pass; each accepted speculation multiplies effective throughput. KV-cache compression (--kv-cache-dtype fp8) and the Marlin MoE backend (--moe-backend marlin) round out the configuration.

The new single-request comparison:

Metric	H100 (35B MoE + MTP)	Strix Halo (35B MoE)	Δ
Median TTFT (4k input)	123 ms	434 ms	~3.5×
Median TPOT (decode step)	3.6 ms	~18.5 ms	~5.1×
Single-request generation rate	236 t/s	54 t/s	~4.4×

The "Strix Halo holds its own on generation" framing from the first benchmark was wrong. With both platforms running the model that actually suits them, with each platform's best decoding pipeline, the H100 is roughly 4.4× faster on raw single-user generation. The earlier 1.4× number was the H100 deliberately handicapped by dense-model arithmetic. Once it gets to use MoE plus MTP, the gap is the gap.

Two specific things drive the uplift. MoE means fewer active parameters per token, so each forward pass is cheaper. MTP means each forward pass can yield two-plus tokens instead of one. Multiply those and you get the 3.5× TPOT improvement at low load (3.6 ms versus 12.8 ms on the dense run), which compounds into the 3.15× single-request output rate versus the same H100's dense numbers.

Concurrency moves further in the same direction

The first sweep showed the H100 scaling to roughly 12× the effective serving capacity of Strix Halo at the 16-agent operating point. With the MoE+MTP configuration, that gap roughly doubles.

The mid-context concurrency shape (4k input, 512 output, agentic-coding turn size):

Concurrency	Aggregate output t/s	Per-agent t/s	Median TTFT	P99 TTFT	Median TPOT
1	236	236.2	123 ms	141 ms	3.6 ms
4	653	163.4	149 ms	437 ms	5.3 ms
8	1,163	145.3	231 ms	820 ms	6.2 ms
16	1,654	103.3	283 ms	1.62 s	8.3 ms
32	2,229	69.7	647 ms	3.17 s	11.1 ms

The sweet spot moves from "8 to 16 agents" to "16 to 32 agents." At concurrency 32 you still get 70 t/s per agent (more than Strix Halo's single-user rate) and 2,229 t/s aggregate. At concurrency 16 the aggregate is 1,654 t/s with P99 TTFT under 1.7 s, comfortably interactive.

Strix Halo doing the same 16-agent workload still takes roughly 60 seconds (llama.cpp does not batch). The H100 with MoE+MTP does it in roughly 2.5 seconds. The effective serving advantage is now somewhere around 24×, not 12×.

Cost recalculated

Same UpCloud business-hours pricing (€1.79/hr) divided by the new throughput numbers:

Operating point	Output t/s	€/M output tokens
Single agent	236	€2.11
8 concurrent	1,163	€0.43
16 concurrent	1,654	€0.30
32 concurrent	2,229	€0.22

At the operating point a small dev team would actually use (16 concurrent agents during work hours), the marginal output cost is roughly €0.30 per million tokens. Anthropic's GPT-5.2-class pricing is now $14 per million output tokens, so the gap is roughly 45×. Even accounting for chain-of-thought overhead on a reasoning model (more on that below) and the fact that real traffic gets lower MTP acceptance than --ignore-eos benchmarks, the economics are not close.

What this means in practice

Two specific deployment shapes still look obviously correct, but the second one looks more obviously correct than I wrote four days ago.

One Strix Halo per developer for personal use, with the right model shape. For one human at one keyboard running an MoE model that fits the memory architecture, Strix Halo remains a reasonable personal-AI machine. 128 GB unified memory, real bandwidth, no API bill, no data leaving the box. The caveat is still model selection: dense 27B-class models are not what this machine is good at. Pick the MoE variants that suit the bandwidth profile. The Strix Halo setup guide and the gotchas post cover what it takes to actually get there. What the new H100 numbers do change is your expectations of single-user speed: at 54 t/s on Strix Halo versus 236 t/s on a properly-configured H100 endpoint, the H100 is meaningfully snappier to use for the same task. Strix Halo's win is locality and cost, not throughput.

One H100 (rented) per team for multi-agent backends. The moment your workflow runs more than one agent at a time, the architecture gap is the gap that matters, and it is now even bigger. vLLM's continuous batching plus MoE plus MTP turns one GPU into a serving fleet that absorbs ~2,200 output tokens per second at the operational ceiling. pi-ensemble dispatches up to six specialist children per /work invocation; on Strix Halo those run sequentially, on an H100 they run concurrently, and at 16-agent concurrency four-to-six developers can each run their own pi-ensemble simultaneously without anyone noticing.

This is the shape we are using at Trail Openers. One shared H100 in UpCloud's Helsinki data centre, internal endpoints behind Caddy, business-hours uptime. The decision to host locally rather than reach for a hyperscaler GPU instance was driven by three things at once: the sustainability footprint (UpCloud's energy mix is majority renewable and their ESG reporting covers scope 1, 2, and 3 with compensation for scope 1+2), the data-jurisdiction story (everything stays in the EU, GDPR-native, no extra-territorial transfers), and the simple fact that the machine sits in the same metropolitan area as the team using it. Latency from a developer's desk in greater Helsinki to a model running in fi-hel2 is dominated by the local fibre hop, not by any cross-continent route. None of this makes inference free of footprint (it never is), but it shifts the marginal cost of an extra agent run onto a cleaner grid, in a friendlier jurisdiction, with materially better latency than the default AWS/GCP region you would otherwise reach.

The caveats are real

Five things to know before you act on this.

The 35B-A3B variant trades a bit of reasoning quality for throughput. Qwen's own SWE-bench numbers put the dense 27B at ~77% and the 35B-A3B at ~73%. For complex multi-step code debugging the dense model is still the better tool. For chat, summarisation, retrieval-augmented Q&A, and the bulk of agentic-coding work, the MoE+MTP combination wins on every operational axis. Pick the model by workload; do not assume one is universally better.

MTP acceptance is workload-dependent. The 3.15× single-request uplift in the table above is on --ignore-eos random-token traffic, which is unusually easy speculation. Real chat workloads, especially code generation with strict syntax, see lower acceptance rates and therefore lower uplift. Plan for 2.0× to 2.5× sustained uplift in production rather than the 3.15× peak.

Reasoning models burn most of the output budget on chain-of-thought. Qwen3.6 emits roughly 80% of every response as reasoning trace before the final answer. The 512-token output budgets in the tables above are mostly thinking, not result. If you want 200 tokens of actual answer, plan for 1,000-1,500 output tokens per agent. A non-reasoning model, or chat_template_kwargs.enable_thinking: false, would shift this picture significantly.

P99 TTFT under load still has a tail. At concurrency 16 the median TTFT is 283 ms but the P99 is 1.62 s. At concurrency 32 the P99 is 3.17 s. This is substantially better than the dense run (which hit P99 of 7.45 s at concurrency 32), but the variance still grows with load. If interactive consistency matters more than aggregate throughput, cap at 16.

First-deploy cold start is expensive. torch.compile takes about 20 minutes on first run. Subsequent boots reuse the cache and reach /health 200 in roughly 2 minutes. Worth knowing if you spin instances up and down.

The takeaway

I came into the first benchmark expecting the H100 to be dramatically faster at everything. The first sweep, with the dense 27B, suggested it was only 1.4× faster on single-user generation, and I wrote that up. That conclusion was wrong, and the way it was wrong is instructive. Comparing different-shape models across different hardware does not isolate the hardware. It tells you what your model choice is doing.

When both platforms run the same MoE variant with each platform's best decoding stack, the H100 is roughly 4.4× faster on single-user generation, roughly 8× faster on prefill, and roughly 24× more effective at concurrent serving. The architectural gap I described four days ago is real and bigger than I said. The 1.4× number should not have been the headline.

For our use at Trail Openers, the conclusion is sharper than before. Strix Halo on the desk for personal work, running MoE models that fit the bandwidth profile. A shared H100 in UpCloud's Helsinki data centre during business hours, serving the team's multi-agent backends with the sustainability footprint, EU data residency, and same-city latency we wanted. Same shape of answer as before, with a clearer view of how big the gap actually is when you compare like for like.

Benchmark details: vLLM 0.21.0. Dense 27B run (2026-06-01): max_model_len=8192, max_num_seqs=256, gpu_memory_utilization=0.9. MoE+MTP run (2026-06-05): max_num_seqs=128, gpu_memory_utilization=0.85, kv_cache_dtype=fp8, moe_backend=marlin, reasoning_parser=qwen3, speculative_config={"method":"mtp","num_speculative_tokens":2}. Both swept with vllm bench serve --dataset-name random --ignore-eos. Hardware: UpCloud H100 80GB SXM, single GPU, fi-hel2. Strix Halo numbers from my own internal benchmark notes, best of ROCm pr21344 / Vulkan RADV / ROCm 7.2.3 backends.