Methodology · v1

How we measure

Every hour, on the half hour, we send the same prompt to every enabled endpoint, sequentially. The results land in a public Postgres database. The same database powers the dashboard you saw on the home page.

The prompt

"Explain the theory of relativity in simple terms."

With temperature: 0, max_tokens: 256, no system prompt. Same for every endpoint.

How tok/s is calculated

Wherever possible we use the inference engine's own timing — never wall-clock from the bench runner, which would conflate network latency with decode speed.

engine — Ollama's eval_count / eval_duration from the response. The gold standard.
server_wall — when the upstream strips eval_duration (Ollama Cloud does this), we fall back to eval_count / total_duration. Still server-measured, just less granular.
vllm_metrics — for vLLM, we snapshot vllm:request_decode_time_seconds_sum and vllm:generation_tokens_total before and after the run, then divide the deltas. Immune to network noise.

Status semantics

Each run gets exactly one of: ok, degraded, rate_limited, provider_error, auth_failure, unreachable, timeout. Uptime percentages on the dashboard count ok÷total. Performance trends include ok only.

What we deliberately don't do

We don't run concurrency sweeps on the hourly job. This is a single-stream decode benchmark.
We don't measure TTFT yet — adding that requires streaming, which the bench runner currently doesn't use. It's coming.
We don't grade output quality. Same prompt, same temperature; the answer is irrelevant — only how fast it arrives.
We don't bench from multiple geographic locations. All hits originate from a single LAN in the United States. Your latency may differ.

Hardware reference

The local endpoint is HYDRA: a 4× RTX 3090 build running vLLM 0.17 with tensor parallelism = 4. Full spec sheet linked.

We update this page when the methodology changes, never silently. If something looks wrong, the source for the entire pipeline lives on GitHub.