llm·monitor

/ Playground

Race TITAN against Ollama Cloud, live

Two endpoints, one model (gemma4-31B), one prompt. Both stream tokens at the same time. The dashboard shows you that TITAN is roughly 3× faster on average—this is what "3× faster" actually feels like.

Same model, different infrastructure

Pick a prompt. We send it to TITAN (4×RTX 3090, vLLM, on-prem) and Ollama Cloud (managed) at the same moment, both running gemma4-31B. Both models are warmed up before the race so the comparison reflects steady-state speed, not cold-start.

TITAN · gemma4 31B
4×RTX 3090 · vLLM · on-prem
tok/s
tokens 0
elapsed
Pick a prompt and click “Race”.
Ollama Cloud · gemma4 31B
managed inference
tok/s
tokens 0
elapsed
Pick a prompt and click “Race”.

How this differs from the dashboard numbers

The hourly bench probe deliberately exposes cold-start variability — it sends each prompt without a separate warmup, so a freshly-routed cloud worker can produce a slow run, and that slow run shows up in the trend chart.

The playground does the opposite: a quick throwaway request hits both endpoints first, so by the time the visible race starts, both models are loaded in GPU memory. What you see here is steady-state decode speed, not cold-start. Both numbers are real — they just answer different questions.