Tensormux · The control plane for LLM inference

Install

Get Tensormux running

Clone the repository and run with Docker Compose or install from source.

Docker Compose

git clone https://github.com/KrxGu/Tensormux.git
cd Tensormux
docker compose up --build

From source (Python)

git clone https://github.com/KrxGu/Tensormux.git
cd Tensormux
pip install -e .

Quickstart

Three steps to route inference

Create a config file, start the gateway, and point your OpenAI SDK at it.

Create config.yaml

config.yaml

gateway:
  host: 0.0.0.0
  port: 8080
  strategy: least_inflight
 
backends:
  - name: vllm-fast
    url: http://vllm-fast:8000
    engine: vllm
    model: llama-3.1-8b
    weight: 80
    health_endpoint: /v1/models
    tags: ["fast", "gpu-a10"]
 
  - name: sglang-cheap
    url: http://sglang-cheap:8000
    engine: sglang
    model: llama-3.1-8b
    weight: 20
    health_endpoint: /v1/models
    tags: ["cheap", "gpu-t4"]
 
health:
  interval_s: 5
  timeout_s: 2
  fail_threshold: 2
  success_threshold: 1
 
logging:
  level: info
  jsonl_path: tensormux.jsonl

Start the gateway

Docker Compose

services:
  tensormux:
    build: .
    ports:
      - "8080:8080"
    environment:
      - TENSORMUX_CONFIG=/app/config.yaml
    volumes:
      - ./config.yaml:/app/config.yaml:ro

Point your OpenAI SDK

TypeScript

import OpenAI from "openai";
 
const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY ?? "not-used-for-oss-backends",
  baseURL: "http://YOUR_TENSORMUX_HOST:8080/v1",
});

Reference

Configuration overview

All fields supported by tensormux.yaml.

gateway.strategy

least_inflightewma_latencyweighted_round_robin

Routing strategy for distributing requests across backends.

backends[].name

Unique name for the backend. Used in logs and metrics.

backends[].url

Base URL of the inference backend (e.g., http://vllm:8000).

backends[].engine

vllmsglangtensorrt-llm

Inference engine type. Used for tagging only.

backends[].weight

Weight for weighted round-robin routing. Higher values get more traffic.

backends[].health_endpoint

HTTP path used for health checks. Defaults to /v1/models.

backends[].tags

List of string tags for labeling and filtering (e.g., region, GPU tier).

health.interval_s

Seconds between health check probes per backend.

health.fail_threshold

Consecutive failures before marking a backend unhealthy.

health.success_threshold

Consecutive successes before restoring a backend to healthy.

logging.level

debuginfowarningerror

Log verbosity level.

logging.jsonl_path

File path for JSONL audit logs. Records every routed request with backend, latency, and status.

Full reference documentation

Source code, contributing guide, and full API docs are in the GitHub repository.

View on GitHub →

Get started with Tensormux Gateway

Get Tensormux running

Three steps to route inference

Configuration overview