Live benchmarks · since April 2026
Cloud LLM inference, measured against a real local rig — every hour, with the receipts.
We run the same prompt through Ollama Cloud and a 4×RTX 3090 vLLM build called TITAN (the rig formerly known as HYDRA — same 4× 3090s, new chassis), every hour, and write the results to a public dashboard. Methodology is open.
Each point is one sample, taken at the top of the hour: one warmup run discarded, one timed run recorded. Same prompt every time. When an hour has no successful run, the line dives to the floor and a red dot marks the incident — timeout, rate-limit, or other non-OK status. We don't smooth incidents into the curve. Full methodology.
vLLM TP=4 on 4×RTX 3090: 76.9 tok/s, no marketing spin
A month of single-stream decode benchmarking on the 4×3090 rig (then-named HYDRA, now TITAN), including why we removed our NVLink bridges and got faster anyway.
If you don't want to spend $6,200 on used GPUs
Cloud-rented A100s and H100s sit between TITAN and Ollama Cloud on price-per-token. Worth a look if your workload is bursty or you're testing before you build.