Tensormux is a Kubernetes-native control plane that turns a GPU fleet into a multi-tenant LLM inference service, on infrastructure you own. This is the first in a series of benchmarks we are publishing, and we are keeping every one of them transparent and reproducible. We started with the question that comes before all the others: under real production load, does the platform hold its latency SLA, stay reliable, and cost what you would expect?
On 4×H100, Tensormux serves Llama-3.1-8B-Instruct at ~2200 output tokens/sec per GPU (about 13 ms per token) and holds p95 time-to-first-token under ~400 ms, less than half of our 1000 ms SLA, with zero failed requests. At typical H100 rates that is roughly $0.32 per million output tokens.
Key results · Llama-3.1-8B-Instruct · vLLM · BF16 · 4×H100
What we do
Running inference in production is not the same as running a model on a single GPU. Once traffic grows, you are routing every request, scaling GPU-backed replicas, isolating tenants, metering usage, and holding a latency SLA, all while trying to keep the bill sane. That orchestration work is a different job from the inference engine itself.
Tensormux lives at that orchestration layer, the tier between your app gateway and the inference engine. It does not replace vLLM or SGLang; it runs on top of them and handles what an engine leaves out: LLM-aware routing, GPU-aware autoscaling, a distributed KV cache, and real multi-tenancy (per-tenant limits, metering, isolation). You bring your engine and your GPUs, and we get them serving more traffic, more reliably, to more tenants.
What we measured
We kept this run deliberately controlled and easy to reproduce: same model, same engine, fixed capacity, and only one variable changing between runs, the routing strategy. No custom runtime, no quantization tricks. Just the platform doing its job on a standard stack.
| Model | Llama-3.1-8B-Instruct · BF16 (no quantization) |
| Engine | vLLM v0.23.0 (standard open source) |
| Hardware | 4 × NVIDIA H100 80GB SXM5 · 1 GPU per pod · 4 replicas |
| Capacity | Fixed · autoscaling disabled |
| Workload | vllm bench serve · 1024 in / 256 out · steady-state · concurrency 128 |
| SLA target | p95 TTFT < 1000 ms |
What the fleet did under load
Pass criteriaA run passes when every routing strategy holds p95 TTFT under the 1000 ms SLA with zero failed requests. The four checks above are the result of this run against that bar.
Why routing is where a control plane earns its keep
Where a request lands is not a small detail. Modern engines hold a KV cache of the work they have already done, so when a request reaches a replica that already has its prompt prefix cached, the engine skips the expensive prefill step and the first token comes back quickly. Send that same request to a cold replica and it repeats all of that work. Multiply this across shared system prompts, RAG context, long chat histories, and many tenants on the same GPUs, and routing becomes one of the biggest levers you have over both latency and cost.
This is what Tensormux is built around. It ships prefix-cache-aware routing next to load-based strategies like least-request and least-latency, so each request goes to the replica most likely to answer it fastest.
This first benchmark measures the floor, not the ceiling. The workload here is uniform with no shared context to reuse, and all four replicas are identical, so there is genuinely nothing for a smart router to exploit. That is the point. Before we show routing pulling ahead on the workloads where it should, shared prefixes and multi-tenant contention, both next in this series, we wanted to confirm the boring but essential thing first: every routing mode stays stable and holds the SLA under sustained load.
| Routing strategy | p95 TTFT | SLA budget used | Headroom | SLA |
|---|---|---|---|---|
| Least Request | 394 ms | 39.4% | 60.6% | ✓ |
| Multi-strategy | 399 ms | 39.9% | 60.1% | ✓ |
| Least Latency | 442 ms | 44.2% | 55.8% | ✓ |
| Throughput | 450 ms | 45.0% | 55.0% | ✓ |
| Random | 461 ms | 46.1% | 53.9% | ✓ |
Table 1Per-strategy p95 TTFT and SLA headroom. Multi-strategy is a blend that combines least-request, least-latency, and throughput signals into a single routing decision, rather than a separate algorithm. Throughput (~2200 tok/s per H100), time per output token (~13 ms), and cost (~$0.32 per 1M tokens) were effectively identical across all five. Config: c=128, steady-state, around 80% GPU utilization, kept below saturation on purpose. Fixed seeds, identical request order, declared configs.
That cost is the fully-loaded number for serving on your own H100 at this latency-optimized point, and it drops further at higher utilization, the usual trade between latency and cost. It is priced on GPU time, not resold through an API. The value is that it is a known, predictable number on hardware you control.
Why this matters for you
Before anything clever, the unglamorous thing has to be true: the platform holds a real SLA, stays reliable, and costs what you expect, on standard hardware and a standard engine. That is what this run shows, and it is the foundation everything else builds on.
Sub-500 ms p95 TTFT at around $0.32 per 1M tokens, with zero failed requests. You get a known SLA and a known unit cost instead of a surprise bill from a black-box API.
Dedicated, on-prem, or sovereign. Your models and data stay inside your environment, which is what regulated and data-sovereignty deployments actually require.
Serve many teams or customers on the same GPUs with per-tenant isolation, rate limits, metering, and billing. That is an inference business, not just an endpoint.
We run on the vLLM or SGLang you already trust and inherit everything the open-source ecosystem ships. There is no proprietary runtime to bet the company on.
We are not trying to be the fastest kernel; that is the engine's job. What Tensormux does better than anyone is get you to production-grade, multi-tenant, sovereign inference on your own GPUs, without making you build and run the orchestration layer yourself.
What's next in this series
This run validated the foundation. The next two go straight at the part a single-tenant throughput test cannot show, where the control plane actually pulls ahead:
- Prefix-aware routing on shared-prefix workloads (system prompts, RAG, multi-turn), where the routing decision cuts TTFT by a lot rather than a little.
- Multi-tenant SLA isolation, holding every tenant's SLA even while one tenant spikes, the noisy-neighbour test, and how many models you can pack onto a GPU while still meeting SLA.
Want this on your own GPUs?
We will help you map dedicated, on-prem, or sovereign serving to your stack, or you can start free with the open-source gateway.
Methodology and reproducibility. Llama-3.1-8B-Instruct (BF16) on vLLM v0.23.0 · 4×H100 80GB SXM5 · vllm bench serve · 1024/256 · steady-state · c=128 · SLA target p95 TTFT < 1000 ms · a run passes when the SLA is met with zero failed requests. Throughput is output tokens per second; cost is based on GPU time at the stated H100 rates. Full configs and scenario files are published alongside this post.
© Tensormux · First in an ongoing, transparent benchmark series.

