Tensormux is now part of the NVIDIA Inception Programv1.0 of the open-source gateway is live
Gateway docs · open source

Get started with Tensormux Gateway

Self-host the open-source gateway in minutes. One config file, one endpoint.

Looking for the managed control plane (shared, dedicated, or on-prem)?See serving models →
Install

Get Tensormux running

Clone the repository and run with Docker Compose or install from source.

Docker Compose
git clone https://github.com/KrxGu/Tensormux.git
cd Tensormux
docker compose up --build
From source (Python)
git clone https://github.com/KrxGu/Tensormux.git
cd Tensormux
pip install -e .
Quickstart

Three steps to route inference

Create a config file, start the gateway, and point your OpenAI SDK at it.

1

Create config.yaml

config.yaml
gateway:
host: 0.0.0.0
port: 8080
strategy: least_inflight
backends:
- name: vllm-fast
url: http://vllm-fast:8000
engine: vllm
model: llama-3.1-8b
weight: 80
health_endpoint: /v1/models
tags: ["fast", "gpu-a10"]
- name: sglang-cheap
url: http://sglang-cheap:8000
engine: sglang
model: llama-3.1-8b
weight: 20
health_endpoint: /v1/models
tags: ["cheap", "gpu-t4"]
health:
interval_s: 5
timeout_s: 2
fail_threshold: 2
success_threshold: 1
logging:
level: info
jsonl_path: tensormux.jsonl
2

Start the gateway

Docker Compose
services:
tensormux:
build: .
ports:
- "8080:8080"
environment:
- TENSORMUX_CONFIG=/app/config.yaml
volumes:
- ./config.yaml:/app/config.yaml:ro
3

Point your OpenAI SDK

TypeScript
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY ?? "not-used-for-oss-backends",
baseURL: "http://YOUR_TENSORMUX_HOST:8080/v1",
});
Reference

Configuration overview

All fields supported by tensormux.yaml.

gateway.strategy
least_inflightewma_latencyweighted_round_robin

Routing strategy for distributing requests across backends.

backends[].name

Unique name for the backend. Used in logs and metrics.

backends[].url

Base URL of the inference backend (e.g., http://vllm:8000).

backends[].engine
vllmsglangtensorrt-llm

Inference engine type. Used for tagging only.

backends[].weight

Weight for weighted round-robin routing. Higher values get more traffic.

backends[].health_endpoint

HTTP path used for health checks. Defaults to /v1/models.

backends[].tags

List of string tags for labeling and filtering (e.g., region, GPU tier).

health.interval_s

Seconds between health check probes per backend.

health.fail_threshold

Consecutive failures before marking a backend unhealthy.

health.success_threshold

Consecutive successes before restoring a backend to healthy.

logging.level
debuginfowarningerror

Log verbosity level.

logging.jsonl_path

File path for JSONL audit logs. Records every routed request with backend, latency, and status.

Full reference documentation

Source code, contributing guide, and full API docs are in the GitHub repository.

View on GitHub →