ollama·watch

Live benchmarks · since April 2026

Cloud LLM inference, measured against a real local rig — every hour, with the receipts.

We run the same prompt through Ollama Cloud and a 4×RTX 3090 vLLM build called TITAN (the rig formerly known as HYDRA — same 4× 3090s, new chassis), every hour, and write the results to a public dashboard. Methodology is open.

Live · Endpoint StatusUpdated 5m ago · refresh hourly
gemma4-26b @ http://192.168.8.230:8002
OK5m ago
142.6tok/s · last run
142.6 tok/s · 24h avg
±9.4% (±13.0 tok/s) · variance · 24h
66.7% uptime · 24h · vllm_metrics
incidentincidentincidentincident
gemma4-26b @ http://192.168.8.230:8001
OK5m ago
139.7tok/s · last run
139.6 tok/s · 24h avg
±0.7% (±1.0 tok/s) · variance · 24h
36.4% uptime · 24h · vllm_metrics
incidentincidentincidentincidentincidentincidentincidentincidentincidentincidentincidentincidentincidentincident
whisper-large-v3 @ http://192.168.8.234:8002
DEGRADED5m ago
tok/s · last run
tok/s · 24h avg
· variance · 24h
0.0% uptime · 24h
qwen3-vl-8b @ http://192.168.8.234:8001
OK5m ago
138.5tok/s · last run
138.7 tok/s · 24h avg
±0.2% (±0.2 tok/s) · variance · 24h
100.0% uptime · 24h · llamacpp_timings
gpt-oss:120b-cloud @ ollama-cloud (ollama)
OK5m ago
124.6tok/s · last run
117.7 tok/s · 24h avg
±13.6% (±15.0 tok/s) · variance · 24h
86.4% uptime · 24h · server_wall
incidentincidentincident
gemma4:31b-cloud @ ollama-cloud (ollama)
OK5m ago
113.2tok/s · last run
87.6 tok/s · 24h avg
±41.1% (±34.9 tok/s) · variance · 24h
86.4% uptime · 24h · server_wall
incidentincidentincident
glm-5.1:cloud @ ollama-cloud (ollama)
OK5m ago
69.0tok/s · last run
88.7 tok/s · 24h avg
±37.4% (±35.4 tok/s) · variance · 24h
86.4% uptime · 24h · server_wall
incidentincidentincident
deepseek-v4-flash:cloud @ ollama-cloud (ollama)
OK5m ago
101.0tok/s · last run
64.1 tok/s · 24h avg
±31.6% (±22.7 tok/s) · variance · 24h
86.4% uptime · 24h · server_wall
incidentincidentincident
kimi-k2.6:cloud @ ollama-cloud (ollama)
OK6m ago
110.9tok/s · last run
51.5 tok/s · 24h avg
±45.0% (±28.2 tok/s) · variance · 24h
86.4% uptime · 24h · server_wall
incidentincidentincident
gemma4-26b @ http://192.168.8.231:8001
OK6m ago
74.9tok/s · last run
75.2 tok/s · 24h avg
±0.4% (±0.3 tok/s) · variance · 24h
100.0% uptime · 24h · llamacpp_timings
gemma4-26b @ http://192.168.8.231:8000
OK6m ago
75.1tok/s · last run
75.5 tok/s · 24h avg
±0.2% (±0.2 tok/s) · variance · 24h
100.0% uptime · 24h · llamacpp_timings
gemma4-26b @ http://192.168.8.233:8001
OK6m ago
232.8tok/s · last run
232.7 tok/s · 24h avg
±0.1% (±0.1 tok/s) · variance · 24h
100.0% uptime · 24h · vllm_metrics
llama3.2:3b @ apps-server1 (ollama)
OK6m ago
148.4tok/s · last run
148.0 tok/s · 24h avg
±1.4% (±2.0 tok/s) · variance · 24h
100.0% uptime · 24h · engine
llava:7b @ apps-server1 (ollama)
OK6m ago
46.0tok/s · last run
46.0 tok/s · 24h avg
±2.3% (±1.1 tok/s) · variance · 24h
100.0% uptime · 24h · engine
qwen2.5vl:7b @ apps-server1 (ollama)
OK7m ago
22.7tok/s · last run
22.3 tok/s · 24h avg
±3.7% (±0.8 tok/s) · variance · 24h
100.0% uptime · 24h · engine
Decode tok/s · 24H trend

Each point is one sample, taken at the top of the hour: one warmup run discarded, one timed run recorded. Same prompt every time. When an hour has no successful run, the line dives to the floor and a red dot marks the incident — timeout, rate-limit, or other non-OK status. We don't smooth incidents into the curve. Full methodology.

Latest from the lab
▸ vLLM TP=4
Apr 25, 2026 · 12 min

vLLM TP=4 on 4×RTX 3090: 76.9 tok/s, no marketing spin

A month of single-stream decode benchmarking on the 4×3090 rig (then-named HYDRA, now TITAN), including why we removed our NVLink bridges and got faster anyway.

Build something like TITAN · or rent the equivalent

If you don't want to spend $6,200 on used GPUs

Cloud-rented A100s and H100s sit between TITAN and Ollama Cloud on price-per-token. Worth a look if your workload is bursty or you're testing before you build.