llm·monitor

Live monitoring · since April 2026

Live GPU-cluster inference monitoring — every endpoint, every hour, with the receipts.

We probe every inference endpoint across the cluster — the 4×RTX 3090 vLLM rig (TITAN), the RTX 5090 (FORGE), the dual R9700 box (HYDRA-R), and SCOUT — once an hour, and write tokens/second, uptime, and incidents to a public dashboard. Methodology is open.

Decode tok/s · 24H trend

Each point is one sample, taken at the top of the hour: one warmup run discarded, one timed run recorded. Same prompt every time. When an hour has no successful run, the line dives to the floor and a red dot marks the incident — timeout, rate-limit, or other non-OK status. We don't smooth incidents into the curve. Full methodology.

Latest from the lab
▸ vLLM TP=4
Apr 25, 2026 · 12 min

vLLM TP=4 on 4×RTX 3090: 76.9 tok/s, no marketing spin

A month of single-stream decode benchmarking on the 4×3090 rig (then-named HYDRA, now TITAN), including why we removed our NVLink bridges and got faster anyway.

Build something like TITAN · or rent the equivalent

If you don't want to spend $6,200 on used GPUs

Cloud-rented A100s and H100s are a middle ground on price-per-token. Worth a look if your workload is bursty or you're testing before you build.