Live benchmarks · since April 2026

Cloud LLM inference, measured against a real local rig — every hour, with the receipts.

We run the same prompt through Ollama Cloud and a 4×RTX 3090 vLLM build called HYDRA, every hour, and write the results to a public dashboard. Methodology is open. So is the source code.

Live · Endpoint StatusLoading…

No bench data yet — check back after the first hourly run.

Decode tok/s · 7D trend

no data yet

No successful runs in the selected range.

Latest from the lab

▸ vLLM TP=4

Apr 25, 2026 · 12 min

vLLM TP=4 on 4×RTX 3090: 76.9 tok/s, no marketing spin

A month of single-stream decode benchmarking on the HYDRA rig, including why we removed our NVLink bridges and got faster anyway.

▸ TP topology

Apr 22, 2026 · 8 min

TP must divide attention heads: debunking the 6×3090 myth

Gemma-4-31B has 32 attention heads. Here is why TP=5 silently breaks and what that means for your build budget.

▸ Cloud vs local

Apr 18, 2026 · 6 min

Ollama Cloud vs HYDRA: a head-to-head over 30 days

When the cloud wins, when local wins, and the cold-start signature you will not see in any vendor marketing material.

Build something like HYDRA · or rent the equivalent

If you don't want to spend $6,600 on used GPUs

Cloud-rented A100s and H100s sit between HYDRA and Ollama Cloud on price-per-token. Worth a look if your workload is bursty or you're testing before you build.

RunPod

Community cloud · per-second billing · zero egress fees

Get $10 credit →

Vast.ai

Spot GPU marketplace · cheapest H100s when available

Browse rigs →