The previous two posts in this series were benchmarks: first sweep on a dense 27B, then the like-for-like rerun on the same MoE variant Strix Halo runs, with MTP speculative decoding. The benchmarks closed the question of "is it fast enough." This post is about the question that comes next: "what does it actually cost to run, in money, energy, and CO₂."

We have now had the H100 endpoint in real use at Trail Openers for about a week. Several developers using it for coding work, not synthetic load. The energy and footprint numbers are nothing like the "H100 = 700W" reflex would predict, and the marginal cost across real coding traffic lands at a small fraction of what an equivalent volume of frontier-API tokens would have cost. This post walks through both, with the caveats they deserve.

What we are actually running

One H100 80GB SXM in UpCloud's fi-hel2 data centre. UpCloud's published per-hour rate during business hours, lower outside. vLLM 0.21.0 serving Qwen3.6-35B-A3B-FP8 (the MoE variant, 3B active out of 35B total) with MTP speculative decoding, FP8 KV-cache, Marlin MoE backend. Endpoint behind Caddy with HTTPS. Business-hours scheduling: the box comes up in the morning, goes down in the evening, weekends off.

The deployment is OpenTofu, idempotent, one tofu apply from cold. The economic and footprint shape depends on the scheduling. Running 24/7 would cost roughly three times what business-hours-only does, for no additional throughput when nobody is at a keyboard. Scheduled correctly, the monthly cost lands in a tight, predictable range.

Energy: well below the TDP

The reflex when you hear "H100" is "700W card." That number is the datasheet TDP, which assumes a particular workload (dense compute, BF16, GPU saturated). What we are running does not look like that workload.

Measured draw from nvidia-smi integrated over time, across a week of real use:

StatePower drawNotes
Idle (model loaded, no traffic)~124 WMostly memory refresh and the chip ticking over
Normal working load (light-to-moderate agentic traffic)~192-229 WWhat we see during typical coding hours
Sustained 5-stream load~330 WThe highest sustained draw we have seen in actual use
Datasheet TDP700 WNever approached in this workload

Three reasons the draw stays low. First, the MoE shape: only ~3B of the 35B parameters activate per token, so the compute per token is a fraction of what a dense 35B would burn. Second, FP8 is roughly 2× more energy-efficient than BF16 for the same arithmetic. Third, vLLM's prefix caching eliminates re-computation across conversational turns, which removes a category of work that would otherwise consume tokens and energy for no marginal benefit.

A live calibration confirmed the meter is unbiased (no methodology bug; the low number is real for this workload). The H100 is not a 700W card in the way most people imagine. It is a 700W card running below 50% utilization for this kind of inference, which is the same as saying it is a ~330W card when it matters.

CO₂: single-digit grams per hour

Helsinki sits on one of the cleanest electricity grids in Europe. Finland's lifecycle factor in May 2026 was 54 gCO₂/kWh per Electricity Maps. Apply that to the measured energy draw with a PUE of 1.2:

Hour shapekWh/hrgCO₂/hr
Idle billed hour0.05-0.08~2.7-4.3
Normal load0.09-0.16~5-9
Heaviest sustained load seen~0.32~17

The actual week of data confirms the range. Looking at our busiest billed hour (2026-06-15 09:00 UTC, 109M input tokens through the endpoint): 0.317 kWh, 17.1 gCO₂. Most working hours land in the 5-9 gCO₂ range.

Project that to a month of business-hours operation: roughly 2.7-4.6 kg of CO₂. This is comparable to running a household refrigerator for a few weeks, not to anything that should give anyone climate anxiety. The reason is not that AI inference is magically clean. It is that the specific combination of MoE + FP8 + Helsinki grid + business-hours scheduling sits at the favourable end of every variable that determines the footprint.

The UpCloud fi-hel2 facility additionally runs on 100% renewable energy and feeds waste heat into the district heating network (the operator I have been able to identify serves up to ~28,000 homes from this and adjacent facilities). The marginal kilowatt of compute, on top of being clean at the input side, displaces heating fuel at the output side. None of which makes inference free of footprint. It just shifts where the offset comes from.

An important caveat. The CO₂ numbers are estimated, not live-measured. We are using a constant grid factor (54 gCO₂/kWh) and a constant PUE (1.2). Both vary in reality. The energy figures from nvidia-smi are exact (the GPU's total_energy_consumption counter, sampled to a database every five minutes). The carbon translation on top is reasonable but not certified.

Cost: a fixed ceiling instead of a meter

The interesting property of the cost picture is not the absolute number. It is the shape. A rented dedicated GPU costs what it costs whether the team writes one diff or a hundred. There is no surprise bill, no per-token meter spinning faster as the workload scales. For a team that does not yet know how heavily it will use its agents in any given week, that is a structurally different financial risk profile than paying per token to a frontier API.

The marginal cost across real coding traffic comes out well below current frontier-API rates, and well below the published rates for hosted open-weight inference of the same model. The exact ratios depend on the comparison and the load shape, both covered in the next section. The takeaway for budgeting is simpler: instead of an unbounded line item that scales with usage, you get a predictable monthly figure that lands in roughly the same range regardless of how heavily the box gets driven within the working day.

Versus the alternatives

This is where the picture sharpens. Two comparisons that matter:

Versus Anthropic Sonnet 4.6. At our current load shape, our marginal cost is roughly 15-18× cheaper per million tokens than Sonnet's published rates. But the headline ratio understates the difference for the actual shape of agentic coding traffic, which is dramatically input-heavy. The ratio of input to output tokens in our real usage is around 120:1. The agent reads a lot of code and writes a small diff. On real two-hour samples of our actual workload, the same traffic priced on Sonnet would have cost roughly 23-74× more than running it on our own H100, depending on whether the hour was light or heavy. Frontier APIs bleed on input tokens, and agentic coding is the workload where that bleed hurts most.

Versus a hosted open-weight API serving the same Qwen3.6-35B-A3B model. Hosted open-weight inference of this model is priced an order of magnitude below Sonnet, so the gap narrows. In the near-idle state we are roughly at parity. In busy hours, where our utilisation rises and our marginal output cost drops, we are roughly 3.5× cheaper than the hosted alternative. The price advantage of the self-hosted option grows with utilisation. Below a certain steady-state load the hosted API is the right answer; above it, the rented dedicated GPU wins.

This is the part of the picture that surprises people: open-weight models on hosted APIs have already collapsed most of the price gap to running them yourself. The dominant remaining argument for self-hosting is not "it is much cheaper." It is the structural properties: data sovereignty, fixed cost ceiling, predictable monthly accounting, and the ability to integrate the inference endpoint into the same network and trust boundary as the rest of the infrastructure.

The honest caveats

Five things to know before you read these numbers as a guarantee.

This is early operational data. A week of real but light-to-moderate use with some test traffic mixed in. Not a sustained steady-state under heavy 16-agent multi-team load. The benchmarks suggest the operating economics get better at higher utilisation (marginal cost per output token drops), but I cannot show you a month of that yet.

The CO₂ numbers are estimated, not live-measured. Constant 54 gCO₂/kWh Finland factor, constant 1.2 PUE. Both vary in reality; both are reasonable approximations.

MTP acceptance in production is lower than benchmark. The 3.15× single-request uplift in the benchmarks was on --ignore-eos random-token traffic. Real chat workloads see 2.0-2.5× sustained. Already factored into the operational numbers above, just worth saying out loud.

Business-hours scheduling has real ergonomic costs. You cannot run a long-running agent task overnight if the box is down. We have specific workflows that need this (memory consolidation, batch reviews) and we either schedule them to fit the window or accept a 24/7 cost premium for the specific hours we need.

The Trail Openers context is specific. EU jurisdiction, GDPR concerns, the team's physical location matching the data centre, the company's sustainability stance: these are real reasons for us that may or may not be reasons for you. The economic argument generalises better than the locality argument.

What this changes

For Trail Openers, this confirms the architecture decision. The shared H100 in Helsinki is meaningfully cheaper than the alternatives we were comparing against. The monthly cost ceiling is predictable. The footprint is small and on a grid that is cleaner than nearly any hyperscaler default region. And because everything stays in fi-hel2 and on internal endpoints, the data-sovereignty story is clean.

For anyone evaluating a similar setup: the headline economics are real but the durable arguments are structural. A predictable monthly bill instead of an open per-token meter. EU data residency by construction, not by configuration. Clean grid at the input, heat recovery at the output, single-digit kg of CO₂ per month at our scale. The interesting question is not whether self-hosting is cheap. It is whether the structural properties are worth the operational work, and at what team size the answer flips.

For a team of four-to-six developers doing agentic coding, our experience so far is that the answer flipped some time ago.


Telemetry source: nvidia-smi total_energy_consumption (exact GPU counter, driver 595.58.03), sampled to a database every five minutes, then aggregated into hourly usage reports. Cost figures from UpCloud's published per-hour rates. Energy-to-CO₂ translation: constant 54 gCO₂/kWh (Finland lifecycle, Electricity Maps May 2026) × constant 1.2 PUE.