Live benchmarks · since April 2026
Cloud LLM inference, measured against a real local rig — every hour, with the receipts.
We run the same prompt through Ollama Cloud and a 4×RTX 3090 vLLM build called HYDRA, every hour, and write the results to a public dashboard. Methodology is open. So is the source code.
No bench data yet — check back after the first hourly run.
no data yet
No successful runs in the selected range.
Latest from the lab
▸ vLLM TP=4
Apr 25, 2026 · 12 min
vLLM TP=4 on 4×RTX 3090: 76.9 tok/s, no marketing spin
A month of single-stream decode benchmarking on the HYDRA rig, including why we removed our NVLink bridges and got faster anyway.
▸ TP topology
Apr 22, 2026 · 8 min
TP must divide attention heads: debunking the 6×3090 myth
Gemma-4-31B has 32 attention heads. Here is why TP=5 silently breaks and what that means for your build budget.
▸ Cloud vs local
Apr 18, 2026 · 6 min
Ollama Cloud vs HYDRA: a head-to-head over 30 days
When the cloud wins, when local wins, and the cold-start signature you will not see in any vendor marketing material.
Build something like HYDRA · or rent the equivalent
If you don't want to spend $6,600 on used GPUs
Cloud-rented A100s and H100s sit between HYDRA and Ollama Cloud on price-per-token. Worth a look if your workload is bursty or you're testing before you build.