Live monitoring · since April 2026
Live GPU-cluster inference monitoring — every endpoint, every hour, with the receipts.
We probe every inference endpoint across the cluster — the 4×RTX 3090 vLLM rig (TITAN), the RTX 5090 (FORGE), the dual R9700 box (HYDRA-R), and SCOUT — once an hour, and write tokens/second, uptime, and incidents to a public dashboard. Methodology is open.
Each point is one sample, taken at the top of the hour: one warmup run discarded, one timed run recorded. Same prompt every time. When an hour has no successful run, the line dives to the floor and a red dot marks the incident — timeout, rate-limit, or other non-OK status. We don't smooth incidents into the curve. Full methodology.
vLLM TP=4 on 4×RTX 3090: 76.9 tok/s, no marketing spin
A month of single-stream decode benchmarking on the 4×3090 rig (then-named HYDRA, now TITAN), including why we removed our NVLink bridges and got faster anyway.
If you don't want to spend $6,200 on used GPUs
Cloud-rented A100s and H100s are a middle ground on price-per-token. Worth a look if your workload is bursty or you're testing before you build.