# Tensormux Ecosystem > A comprehensive open-source toolkit and control plane for hosting, optimizing, and load-balancing LLM inference workloads. The Tensormux ecosystem consists of three main projects: 1. **Tensormux Gateway**: A high-performance L7 inference gateway for load-balancing and failover. 2. **TensorPath**: An inference optimization control plane and autonomous kernel optimization compiler (Forge). 3. **kernel-skills**: A curated library of expert-quality engineering playbooks that guide AI agents to write correct and performant compute kernels. --- ## 1. Tensormux Gateway **Repository**: `https://github.com/KrxGu/Tensormux` An open-source, self-hosted Layer 7 (L7) inference gateway written in Python (FastAPI/Uvicorn) that sits between your applications and multiple LLM backends (like vLLM, SGLang, TensorRT-LLM, or Triton). It presents a single, unified, OpenAI-compatible endpoint to manage routing, health, and failover seamlessly. ### Core Architecture & Mechanics - **API Compatibility**: Drop-in replacement for standard OpenAI client configurations (routes `/v1/chat/completions`, `/v1/models`, `/v1/embeddings`, etc.). - **Zero-Copy Streaming**: Forwards server-sent events (SSE) dynamically to client streams with zero-copy byte forwarding for minimum latency overhead. - **Observability**: Exposes a Prometheus metrics endpoint (`/metrics` for request counts, error rates, and latency histograms) and generates a structured JSONL audit trail of every request. - **Live UI Dashboard**: Provides a built-in status page at `/ui` to monitor backend health, active inflight requests, latencies, and routing logs in real time. ### Routing Strategies - `least_inflight`: Routes incoming traffic to the backend currently processing the fewest active concurrent requests. - `ewma_latency`: Calculates an Exponentially Weighted Moving Average (EWMA) of past request durations per backend, routing to the fastest responsive engine. - `weighted_round_robin`: Cycles requests through active backends proportional to pre-configured numerical weights. - `token_aware`: Routes requests based on input/output token estimates to optimize cache reuse. ### Health Probing & Failover - **Active Checking**: Periodically pings backend health endpoints (e.g., `/v1/models`) at a configurable interval. - **Passive Checking**: Monitors live request errors; backends exceeding the error threshold are automatically marked unhealthy and taken out of rotation. - **Automatic Recovery**: Once an unhealthy backend passes the required number of consecutive health probes, it is safely reintroduced to the routing pool. --- ## 2. TensorPath **Repository**: `https://github.com/tensormux/Tensorpath` The inference optimization control plane of the ecosystem. It acts as both a plan recommender for serving models and an autonomous kernel optimization compiler (Forge). ### Recommender System - **Plan Generation**: Scans available GPU hardware tiers, serving backends, and quantization configurations to build possible deployment candidates. - **Hard Constraints**: Excludes plans exceeding available GPU VRAM or defined monthly budget limits. - **Multi-Dimensional Scoring**: Evaluates and ranks remaining candidates on five dimensions: 1. **Latency**: Relative to target constraints. 2. **Throughput**: Tokens per second. 3. **Cost**: Hourly and monthly running costs. 4. **Quality**: Quantization quality loss factor. 5. **Simplicity**: Operational and deployment complexity (e.g., vLLM vs TensorRT-LLM). - **Plan Comparer**: Supports side-by-side comparison across different models (`/compare`). ### Forge (Kernel Optimization Layer) - **Playbook Integration**: Retrieves instruction sets directly from `@krxgu/kernel-skills` npm package. - **Prompt Generation**: Builds detailed 75 KB markdown prompts bundled with skill files for manual execution by coding agents. - **Autonomous Agentic Loop**: Integrates with Claude Opus 4.7. Forge provisions a tool-use environment enabling the agent to: - Write candidate Triton kernels. - Run correctness checks via `pytest` comparing output to a PyTorch reference. - Run benchmark cycles measuring performance against baseline Torch/Triton implementations. - Promote verified kernels achieving >= 1.10x speedup to the local kernel registry (`verified_kernels.json`). --- ## 3. kernel-skills **Repository**: `https://github.com/tensormux/kernel-skills` An open-source library of structured engineering playbooks (`SKILL.md` files) and tools designed for AI coding agents (such as Claude Code, Cursor, or ChatGPT) writing high-performance compute kernels. ### Why it Exists AI models often produce subpar kernel code when given vague prompts: they miss alignment constraints, ignore out-of-bounds safety, lack stable max-subtractions (for softmax), or use inefficient memory access. Each skill enforces a step-by-step playbook that guides the agent through constraint gathering, memory layout design, tile size selection, and validation. ### Curated Skill Categories - **CUDA**: - `write-cuda-gemm-kernel`: Shared memory tiling, double buffering, occupancy, and register budgeting. - `write-cuda-reduction-kernel`: Warp-shuffle parallel reduction tree with multi-block coordination. - `write-cuda-softmax-kernel`: Stable online softmax implementation with warp-level reductions. - `write-cuda-layernorm-kernel`: Welford online variance formula with fp32 accumulation. - **Triton**: - `write-triton-gemm-kernel`: Block tiling, `tl.dot` accumulation, and major-dimension selection. - `write-triton-softmax-kernel`: Reduction axis block sizing and variable sequence length masking. - `write-triton-layernorm-kernel`: Persistent LayerNorm kernels and backward pass accumulation. - `write-triton-attention-kernel`: Causal flash-attention loops, online softmax scaling, and GQA head mapping. - **Inference building blocks (LLaMA-family)**: - Triton kernels for `RMSNorm`, fused `residual-add + RMSNorm`, SwiGLU (`silu(a) * b`), Rotary Position Embeddings (GPT-NeoX vs GPT-J), decode-time sampling (temperature, top-k, top-p), KV cache append (contiguous/paged), and quantized weight dequantization. - **Quantization & Portability**: - `int8` (dp4a) and `fp8` ( Hopper/Ada dynamic scaling) kernels. - CUDA-to-Triton and CUDA-to-HIP translation playbooks. ### Tooling & CLI - **NPM Package**: `@krxgu/kernel-skills` - **CLI Commands**: - `kernel-skills list [--category ]`: Lists available playbooks. - `kernel-skills search `: Finds skills matching keywords. - `kernel-skills bundle `: Combines multiple skill files into a single optimized prompt block for agent ingestion. - **Programmatic API**: Exported JS/TS helpers (`searchSkills`, `getSkill`, `bundleSkills`) for automated agent workflows.