Methodology · v2
How we measure
Every hour, on the half hour, we send the same prompt to every enabled endpoint, sequentially. Each result—successful or not—lands in a Postgres database, and the same database powers everything you see on the dashboard.
The prompt
"Explain theory of relativity simply"
With temperature: 0, max_tokens: 256, no system prompt, no streaming. Same prompt on every endpoint. We do one warmup run that's discarded, then one timed "hot" run that's recorded.
How tok/s is calculated
Wherever possible we use the inference engine's own timing—never wall-clock from the bench runner, which would conflate network latency with decode speed.
- engine—Ollama's
eval_count / eval_durationfrom the response. The gold standard. - server_wall—when the upstream strips
eval_duration(Ollama Cloud does this), we fall back toeval_count / total_duration. Still server-measured, just less granular:total_durationincludes prompt evaluation, so the number is slightly conservative compared to pure decode. - vllm_metrics—for vLLM, we snapshot
vllm:request_decode_time_seconds_sumandvllm:generation_tokens_totalbefore and after the run, then divide the deltas. Immune to network noise.
The numbers on each tile
Every endpoint tile shows three measurements over different windows. They are not the same number, and they're not supposed to be:
- Last run—the tok/s from the most recent successful hourly probe (within the past hour or two). The headline number, but a single sample.
- 24h avg—the median tok/s across all
okruns in the last 24 hours. We use median rather than arithmetic mean because Ollama Cloud routing produces occasional outliers in both directions; median is more representative of typical performance. We label it "avg" because that's the term most readers expect. - Variance—coefficient of variation: sample standard deviation divided by mean, expressed as percent. Lower means more consistent. Below 5% is tight (green), 5–15% is typical for shared cloud infrastructure (amber), above 15% is jittery enough to matter for production agent loops (red). The bracketed value is the absolute standard deviation in tok/s for readers who prefer raw units.
Status semantics
Each run gets exactly one status:
ok—the endpoint returned the requested 256 tokens, with measurable timing. Counted in uptime, contributes to all averages and trend lines.timeout—the request didn't complete within the probe's HTTP timeout (currently 60 seconds for cloud endpoints, 30 for local). The endpoint may have eventually responded; we just stopped waiting. Drops uptime, ignored by averages, drawn as a red dashed tick on sparklines.rate_limited—upstream returned an HTTP 429. Drops uptime; not the endpoint's fault per se but does reflect what a real client would experience.provider_error—upstream returned 5xx, malformed JSON, or unexpected payload shape. Drops uptime.auth_failure—401/403. Almost always a stale credential on our side; we treat it as our problem, not the endpoint's, but still drops uptime.unreachable—TCP/DNS failure before any response. Drops uptime.degraded—reserved for future use; not yet emitted by the probe.
How non-ok runs appear visually
We make incidents visible rather than hiding them inside aggregate numbers:
- Tile sparkline—any non-
okrun becomes a vertical red dashed tick at its X position. The performance line breaks across the gap rather than connecting through it. - Trend chart—non-
okruns show as null values in the underlying data; a small red marker at the bucket position indicates that endpoint had at least one failed probe in that window. - Tiles' headline number—always pulled from the most recent
okrun, so an active incident shows in the status pill ("TIMEOUT", "RATE LIMITED") rather than in a missing tok/s number.
Comparison strip
The "X is currently N% faster than Y" callout below the tiles uses the 7-day median of ok runs across all endpoints, picking the fastest and slowest. Different window from the per-tile numbers because comparisons need stability—a single bad hour shouldn't flip which endpoint is "fastest."
TITAN power figures
The “~800 W under load · ~170 W idle” numbers on the build page are read from a wall meter on the rig's power feed, not from nvidia-smi or summed GPU TDP. A wall reading captures the full system draw — GPUs, CPU, board, fans, drives, and the round-trip efficiency of both PSUs — which is what you'd actually pay for on a power bill.
“Under load” is the steady-state reading during a single-stream decode of the standard bench prompt; idle is the rig sitting at the OS prompt with vLLM loaded but no in-flight requests. A flat-out tensor-parallel job pushes higher than 800 W for short bursts, but the steady-state figure is what matters for sizing PSUs, breakers, and cooling.
Playground vs dashboard numbers
The /playground page warms both endpoints with a throwaway request before each visible race, so the tok/s you see there reflects steady-state decode speed only. The hourly bench probe behind the dashboard does not warm in the same way — it accepts cold-start latency as part of what it's measuring, because real client traffic will sometimes hit cold workers and that variability matters editorially.
Both numbers are correct. They just answer different questions: “how fast is this when it's running” versus “how consistent is this in the wild.”
What we deliberately don't do
- We don't run concurrency sweeps on the hourly job. This is a single-stream decode benchmark. Concurrency profiles (using GuideLLM-style methodology) are a separate roadmap item.
- We don't measure TTFT yet—adding that requires streaming, which the bench runner currently doesn't use. It's coming.
- We don't grade output quality. Same prompt, same temperature; the answer text is irrelevant—only how fast the tokens arrive.
- We don't bench from multiple geographic locations. All probes originate from a single LAN in the United States. Your latency may differ.
- We don't retry failed runs. A timeout is a timeout: that's the data point.
Hardware reference
The primary local endpoint is TITAN: a 4× RTX 3090 build running vLLM 0.17 with tensor parallelism = 4. Full spec sheet linked. On May 14, 2026 the same 4× 3090 stack physically migrated from the HYDRA chassis (X299 Sage) to a new TITAN host (MEG X299 Creation) at 192.168.8.230. The migration is performance-neutral — TITAN benches at 75.62 tok/s, within run-to-run variance of the pre-migration baseline. Historical data prior to May 14 was recorded under the hydra-vllm-gemma4 endpoint name; the same row was renamed in place to titan-vllm-gemma4 rather than re-keyed, so the trend chart shows one continuous line across the rename.
The build page at /hardware/hydra remains the bill of materials for the rig — the URL kept its original slug as a historical artifact.
Revisions
We update this page when the methodology changes, never silently. If something looks wrong, the data is inspectable through the live dashboards.
Last revised when the 4×3090 rig migrated from HYDRA to TITAN on May 14, 2026.