Methodology · v1
How we measure
Every hour, on the half hour, we send the same prompt to every enabled endpoint, sequentially. The results land in a public Postgres database. The same database powers the dashboard you saw on the home page.
The prompt
"Explain the theory of relativity in simple terms."
With temperature: 0, max_tokens: 256, no system prompt. Same for every endpoint.
How tok/s is calculated
Wherever possible we use the inference engine's own timing — never wall-clock from the bench runner, which would conflate network latency with decode speed.
- engine — Ollama's
eval_count / eval_durationfrom the response. The gold standard. - server_wall — when the upstream strips
eval_duration(Ollama Cloud does this), we fall back toeval_count / total_duration. Still server-measured, just less granular. - vllm_metrics — for vLLM, we snapshot
vllm:request_decode_time_seconds_sumandvllm:generation_tokens_totalbefore and after the run, then divide the deltas. Immune to network noise.
Status semantics
Each run gets exactly one of: ok, degraded, rate_limited, provider_error, auth_failure, unreachable, timeout. Uptime percentages on the dashboard count ok÷total. Performance trends include ok only.
What we deliberately don't do
- We don't run concurrency sweeps on the hourly job. This is a single-stream decode benchmark.
- We don't measure TTFT yet — adding that requires streaming, which the bench runner currently doesn't use. It's coming.
- We don't grade output quality. Same prompt, same temperature; the answer is irrelevant — only how fast it arrives.
- We don't bench from multiple geographic locations. All hits originate from a single LAN in the United States. Your latency may differ.
Hardware reference
The local endpoint is HYDRA: a 4× RTX 3090 build running vLLM 0.17 with tensor parallelism = 4. Full spec sheet linked.
We update this page when the methodology changes, never silently. If something looks wrong, the source for the entire pipeline lives on GitHub.