Tensormux is a Kubernetes-native control plane that turns a GPU fleet into a multi-tenant LLM inference service, on infrastructure you own. This is the first in a series of benchmarks we are publishing, and we are keeping every one of them transparent and reproducible. We started with the question that comes before all the others: under real production load, does the platform hold its latency SLA, stay reliable, and cost what you would expect?

On 4×H100, Tensormux serves Llama-3.1-8B-Instruct at ~2200 output tokens/sec per GPU (about 13 ms per token) and holds p95 time-to-first-token under ~400 ms, less than half of our 1000 ms SLA, with zero failed requests. At typical H100 rates that is roughly $0.32 per million output tokens.

2200tok/s

Output throughput per H100

~400ms

p95 TTFT (SLA: 1000 ms)

$0.32

per 1M output tokens

~13ms

Mean time per output token

Failed requests

128

Concurrent requests (steady-state)

Key results · Llama-3.1-8B-Instruct · vLLM · BF16 · 4×H100

The platform

What we do

Running inference in production is not the same as running a model on a single GPU. Once traffic grows, you are routing every request, scaling GPU-backed replicas, isolating tenants, metering usage, and holding a latency SLA, all while trying to keep the bill sane. That orchestration work is a different job from the inference engine itself.

Tensormux lives at that orchestration layer, the tier between your app gateway and the inference engine. It does not replace vLLM or SGLang; it runs on top of them and handles what an engine leaves out: LLM-aware routing, GPU-aware autoscaling, a distributed KV cache, and real multi-tenancy (per-tenant limits, metering, isolation). You bring your engine and your GPUs, and we get them serving more traffic, more reliably, to more tenants.

closest to your application

AI gateway / app routingPortkey, LiteLLM

Inference control planeTensormux: routing, autoscaling, multi-tenancy, KV fabric

Inference enginevLLM, SGLang, TensorRT-LLM (we run on top of these)

Kernel / compilerCUDA, Triton, FlashAttention, CUTLASS

GPU hardware / cloudNVIDIA (H100, L40S, …), AMD, and more · on-prem, dedicated, or sovereign

closest to the raw GPU

Fig 1Where Tensormux sits in the serving stack, ordered from the application at the top down to the raw GPU. The engine and kernel tiers make one replica fast; the control plane is what makes a whole fleet reliable, multi-tenant, and cost-efficient.

Method

What we measured

We kept this run deliberately controlled and easy to reproduce: same model, same engine, fixed capacity, and only one variable changing between runs, the routing strategy. No custom runtime, no quantization tricks. Just the platform doing its job on a standard stack.

Model	Llama-3.1-8B-Instruct · BF16 (no quantization)
Engine	vLLM v0.23.0 (standard open source)
Hardware	4 × NVIDIA H100 80GB SXM5 · 1 GPU per pod · 4 replicas
Capacity	Fixed · autoscaling disabled
Workload	vllm bench serve · 1024 in / 256 out · steady-state · concurrency 128
SLA target	p95 TTFT < 1000 ms

Results

What the fleet did under load

Production validation Pass

p95 time-to-first-token under 1000 ms

Zero failed requests

Throughput stable across all five routing strategies

Cost predictable and known up front

Pass criteriaA run passes when every routing strategy holds p95 TTFT under the 1000 ms SLA with zero failed requests. The four checks above are the result of this run against that bar.

Why routing is where a control plane earns its keep

Where a request lands is not a small detail. Modern engines hold a KV cache of the work they have already done, so when a request reaches a replica that already has its prompt prefix cached, the engine skips the expensive prefill step and the first token comes back quickly. Send that same request to a cold replica and it repeats all of that work. Multiply this across shared system prompts, RAG context, long chat histories, and many tenants on the same GPUs, and routing becomes one of the biggest levers you have over both latency and cost.

This is what Tensormux is built around. It ships prefix-cache-aware routing next to load-based strategies like least-request and least-latency, so each request goes to the replica most likely to answer it fastest.

This first benchmark measures the floor, not the ceiling. The workload here is uniform with no shared context to reuse, and all four replicas are identical, so there is genuinely nothing for a smart router to exploit. That is the point. Before we show routing pulling ahead on the workloads where it should, shared prefixes and multi-tenant contention, both next in this series, we wanted to confirm the boring but essential thing first: every routing mode stays stable and holds the SLA under sustained load.

Fig 2p95 time-to-first-token against the 1000 ms SLA. All five routing strategies land in a single 67 ms band (394 to 461 ms), well under half the budget, with zero failures. With identical replicas and no shared context, routing has little to separate here, exactly as expected. The story is the headroom.

Routing strategy	p95 TTFT	SLA budget used	Headroom	SLA
Least Request	394 ms	39.4%	60.6%	✓
Multi-strategy	399 ms	39.9%	60.1%	✓
Least Latency	442 ms	44.2%	55.8%	✓
Throughput	450 ms	45.0%	55.0%	✓
Random	461 ms	46.1%	53.9%	✓

Table 1Per-strategy p95 TTFT and SLA headroom. Multi-strategy is a blend that combines least-request, least-latency, and throughput signals into a single routing decision, rather than a separate algorithm. Throughput (~2200 tok/s per H100), time per output token (~13 ms), and cost (~$0.32 per 1M tokens) were effectively identical across all five. Config: c=128, steady-state, around 80% GPU utilization, kept below saturation on purpose. Fixed seeds, identical request order, declared configs.

H100-1

~2200

H100-2

~2200

H100-3

~2200

H100-4

~2200

Aggregate output~8800 tok/s

Fig 3Output tokens/sec per H100. Tensormux drives all four replicas evenly, for ~8800 tok/s aggregate across the cluster.

Fig 4Fully-loaded cost per 1M output tokens, read off your own H100 GPU-hour rate. Lower GPU price, lower cost per token.

That cost is the fully-loaded number for serving on your own H100 at this latency-optimized point, and it drops further at higher utilization, the usual trade between latency and cost. It is priced on GPU time, not resold through an API. The value is that it is a known, predictable number on hardware you control.

Implications

Why this matters for you

Before anything clever, the unglamorous thing has to be true: the platform holds a real SLA, stays reliable, and costs what you expect, on standard hardware and a standard engine. That is what this run shows, and it is the foundation everything else builds on.

Predictable latency and cost

Sub-500 ms p95 TTFT at around $0.32 per 1M tokens, with zero failed requests. You get a known SLA and a known unit cost instead of a surprise bill from a black-box API.

On infrastructure you own

Dedicated, on-prem, or sovereign. Your models and data stay inside your environment, which is what regulated and data-sovereignty deployments actually require.

Multi-tenant by design

Serve many teams or customers on the same GPUs with per-tenant isolation, rate limits, metering, and billing. That is an inference business, not just an endpoint.

No engine lock-in

We run on the vLLM or SGLang you already trust and inherit everything the open-source ecosystem ships. There is no proprietary runtime to bet the company on.

We are not trying to be the fastest kernel; that is the engine's job. What Tensormux does better than anyone is get you to production-grade, multi-tenant, sovereign inference on your own GPUs, without making you build and run the orchestration layer yourself.

Roadmap

What's next in this series

This run validated the foundation. The next two go straight at the part a single-tenant throughput test cannot show, where the control plane actually pulls ahead:

Prefix-aware routing on shared-prefix workloads (system prompts, RAG, multi-turn), where the routing decision cuts TTFT by a lot rather than a little.
Multi-tenant SLA isolation, holding every tenant's SLA even while one tenant spikes, the noisy-neighbour test, and how many models you can pack onto a GPU while still meeting SLA.

Want this on your own GPUs?

We will help you map dedicated, on-prem, or sovereign serving to your stack, or you can start free with the open-source gateway.

Book a call Start with OSS

Methodology and reproducibility. Llama-3.1-8B-Instruct (BF16) on vLLM v0.23.0 · 4×H100 80GB SXM5 · vllm bench serve · 1024/256 · steady-state · c=128 · SLA target p95 TTFT < 1000 ms · a run passes when the SLA is met with zero failed requests. Throughput is output tokens per second; cost is based on GPU time at the stated H100 rates. Full configs and scenario files are published alongside this post.