llm·monitor

Methodology · v2

How we measure

Every hour, on the half hour, we send the same prompt to every enabled endpoint, sequentially. Each result—successful or not—lands in a Postgres database, and the same database powers everything you see on the dashboard.

The prompt

"Explain theory of relativity simply"

With temperature: 0, max_tokens: 256, no system prompt, no streaming. Same prompt on every endpoint. We do one warmup run that's discarded, then one timed "hot" run that's recorded.

How tok/s is calculated

Wherever possible we use the inference engine's own timing—never wall-clock from the bench runner, which would conflate network latency with decode speed.

The numbers on each tile

Every endpoint tile shows three measurements over different windows. They are not the same number, and they're not supposed to be:

Status semantics

Each run gets exactly one status:

How non-ok runs appear visually

We make incidents visible rather than hiding them inside aggregate numbers:

Comparison strip

The "X is currently N% faster than Y" callout below the tiles uses the 7-day median of ok runs across all endpoints, picking the fastest and slowest. Different window from the per-tile numbers because comparisons need stability—a single bad hour shouldn't flip which endpoint is "fastest."

TITAN power figures

The “~800 W under load · ~170 W idle” numbers on the build page are read from a wall meter on the rig's power feed, not from nvidia-smi or summed GPU TDP. A wall reading captures the full system draw — GPUs, CPU, board, fans, drives, and the round-trip efficiency of both PSUs — which is what you'd actually pay for on a power bill.

“Under load” is the steady-state reading during a single-stream decode of the standard bench prompt; idle is the rig sitting at the OS prompt with vLLM loaded but no in-flight requests. A flat-out tensor-parallel job pushes higher than 800 W for short bursts, but the steady-state figure is what matters for sizing PSUs, breakers, and cooling.

Playground vs dashboard numbers

The /playground page warms both endpoints with a throwaway request before each visible race, so the tok/s you see there reflects steady-state decode speed only. The hourly bench probe behind the dashboard does not warm in the same way — it accepts cold-start latency as part of what it's measuring, because real client traffic will sometimes hit cold workers and that variability matters editorially.

Both numbers are correct. They just answer different questions: “how fast is this when it's running” versus “how consistent is this in the wild.”

What we deliberately don't do

Hardware reference

The primary local endpoint is TITAN: a 4× RTX 3090 build running vLLM 0.17 with tensor parallelism = 4. Full spec sheet linked. On May 14, 2026 the same 4× 3090 stack physically migrated from the HYDRA chassis (X299 Sage) to a new TITAN host (MEG X299 Creation) at 192.168.8.230. The migration is performance-neutral — TITAN benches at 75.62 tok/s, within run-to-run variance of the pre-migration baseline. Historical data prior to May 14 was recorded under the hydra-vllm-gemma4 endpoint name; the same row was renamed in place to titan-vllm-gemma4 rather than re-keyed, so the trend chart shows one continuous line across the rename.

The build page at /hardware/hydra remains the bill of materials for the rig — the URL kept its original slug as a historical artifact.

Revisions

We update this page when the methodology changes, never silently. If something looks wrong, the data is inspectable through the live dashboards.

Last revised when the 4×3090 rig migrated from HYDRA to TITAN on May 14, 2026.