# Tensormux Ecosystem

> A comprehensive open-source toolkit and control plane for hosting, optimizing, and load-balancing LLM inference workloads.

The Tensormux ecosystem consists of three main projects:
1. **Tensormux Gateway**: A high-performance L7 inference gateway for load-balancing and failover.
2. **TensorPath**: An inference optimization control plane and autonomous kernel optimization compiler (Forge).
3. **kernel-skills**: A curated library of expert-quality engineering playbooks that guide AI agents to write correct and performant compute kernels.

---

## 1. Tensormux Gateway

**Repository**: `https://github.com/KrxGu/Tensormux`

An open-source, self-hosted Layer 7 (L7) inference gateway written in Python (FastAPI/Uvicorn) that sits between your applications and multiple LLM backends (like vLLM, SGLang, TensorRT-LLM, or Triton). It presents a single, unified, OpenAI-compatible endpoint to manage routing, health, and failover seamlessly.

### Core Architecture & Mechanics
- **API Compatibility**: Drop-in replacement for standard OpenAI client configurations (routes `/v1/chat/completions`, `/v1/models`, `/v1/embeddings`, etc.).
- **Zero-Copy Streaming**: Forwards server-sent events (SSE) dynamically to client streams with zero-copy byte forwarding for minimum latency overhead.
- **Observability**: Exposes a Prometheus metrics endpoint (`/metrics` for request counts, error rates, and latency histograms) and generates a structured JSONL audit trail of every request.
- **Live UI Dashboard**: Provides a built-in status page at `/ui` to monitor backend health, active inflight requests, latencies, and routing logs in real time.

### Routing Strategies
- `least_inflight`: Routes incoming traffic to the backend currently processing the fewest active concurrent requests.
- `ewma_latency`: Calculates an Exponentially Weighted Moving Average (EWMA) of past request durations per backend, routing to the fastest responsive engine.
- `weighted_round_robin`: Cycles requests through active backends proportional to pre-configured numerical weights.
- `token_aware`: Routes requests based on input/output token estimates to optimize cache reuse.

### Health Probing & Failover
- **Active Checking**: Periodically pings backend health endpoints (e.g., `/v1/models`) at a configurable interval.
- **Passive Checking**: Monitors live request errors; backends exceeding the error threshold are automatically marked unhealthy and taken out of rotation.
- **Automatic Recovery**: Once an unhealthy backend passes the required number of consecutive health probes, it is safely reintroduced to the routing pool.

---

## 2. TensorPath

**Repository**: `https://github.com/tensormux/Tensorpath`

The inference optimization control plane of the ecosystem. It acts as both a plan recommender for serving models and an autonomous kernel optimization compiler (Forge).

### Recommender System
- **Plan Generation**: Scans available GPU hardware tiers, serving backends, and quantization configurations to build possible deployment candidates.
- **Hard Constraints**: Excludes plans exceeding available GPU VRAM or defined monthly budget limits.
- **Multi-Dimensional Scoring**: Evaluates and ranks remaining candidates on five dimensions:
  1. **Latency**: Relative to target constraints.
  2. **Throughput**: Tokens per second.
  3. **Cost**: Hourly and monthly running costs.
  4. **Quality**: Quantization quality loss factor.
  5. **Simplicity**: Operational and deployment complexity (e.g., vLLM vs TensorRT-LLM).
- **Plan Comparer**: Supports side-by-side comparison across different models (`/compare`).

### Forge (Kernel Optimization Layer)
- **Playbook Integration**: Retrieves instruction sets directly from `@krxgu/kernel-skills` npm package.
- **Prompt Generation**: Builds detailed 75 KB markdown prompts bundled with skill files for manual execution by coding agents.
- **Autonomous Agentic Loop**: Integrates with Claude Opus 4.7. Forge provisions a tool-use environment enabling the agent to:
  - Write candidate Triton kernels.
  - Run correctness checks via `pytest` comparing output to a PyTorch reference.
  - Run benchmark cycles measuring performance against baseline Torch/Triton implementations.
  - Promote verified kernels achieving >= 1.10x speedup to the local kernel registry (`verified_kernels.json`).

---

## 3. kernel-skills

**Repository**: `https://github.com/tensormux/kernel-skills`

An open-source library of structured engineering playbooks (`SKILL.md` files) and tools designed for AI coding agents (such as Claude Code, Cursor, or ChatGPT) writing high-performance compute kernels.

### Why it Exists
AI models often produce subpar kernel code when given vague prompts: they miss alignment constraints, ignore out-of-bounds safety, lack stable max-subtractions (for softmax), or use inefficient memory access. Each skill enforces a step-by-step playbook that guides the agent through constraint gathering, memory layout design, tile size selection, and validation.

### Curated Skill Categories
- **CUDA**:
  - `write-cuda-gemm-kernel`: Shared memory tiling, double buffering, occupancy, and register budgeting.
  - `write-cuda-reduction-kernel`: Warp-shuffle parallel reduction tree with multi-block coordination.
  - `write-cuda-softmax-kernel`: Stable online softmax implementation with warp-level reductions.
  - `write-cuda-layernorm-kernel`: Welford online variance formula with fp32 accumulation.
- **Triton**:
  - `write-triton-gemm-kernel`: Block tiling, `tl.dot` accumulation, and major-dimension selection.
  - `write-triton-softmax-kernel`: Reduction axis block sizing and variable sequence length masking.
  - `write-triton-layernorm-kernel`: Persistent LayerNorm kernels and backward pass accumulation.
  - `write-triton-attention-kernel`: Causal flash-attention loops, online softmax scaling, and GQA head mapping.
- **Inference building blocks (LLaMA-family)**:
  - Triton kernels for `RMSNorm`, fused `residual-add + RMSNorm`, SwiGLU (`silu(a) * b`), Rotary Position Embeddings (GPT-NeoX vs GPT-J), decode-time sampling (temperature, top-k, top-p), KV cache append (contiguous/paged), and quantized weight dequantization.
- **Quantization & Portability**:
  - `int8` (dp4a) and `fp8` ( Hopper/Ada dynamic scaling) kernels.
  - CUDA-to-Triton and CUDA-to-HIP translation playbooks.

### Tooling & CLI
- **NPM Package**: `@krxgu/kernel-skills`
- **CLI Commands**:
  - `kernel-skills list [--category <name>]`: Lists available playbooks.
  - `kernel-skills search <query>`: Finds skills matching keywords.
  - `kernel-skills bundle <skill-ids>`: Combines multiple skill files into a single optimized prompt block for agent ingestion.
- **Programmatic API**: Exported JS/TS helpers (`searchSkills`, `getSkill`, `bundleSkills`) for automated agent workflows.