ollama·watch

Methodology · v1

How we measure

Every hour, on the half hour, we send the same prompt to every enabled endpoint, sequentially. The results land in a public Postgres database. The same database powers the dashboard you saw on the home page.

The prompt

"Explain the theory of relativity in simple terms."

With temperature: 0, max_tokens: 256, no system prompt. Same for every endpoint.

How tok/s is calculated

Wherever possible we use the inference engine's own timing — never wall-clock from the bench runner, which would conflate network latency with decode speed.

Status semantics

Each run gets exactly one of: ok, degraded, rate_limited, provider_error, auth_failure, unreachable, timeout. Uptime percentages on the dashboard count ok÷total. Performance trends include ok only.

What we deliberately don't do

Hardware reference

The local endpoint is HYDRA: a 4× RTX 3090 build running vLLM 0.17 with tensor parallelism = 4. Full spec sheet linked.

We update this page when the methodology changes, never silently. If something looks wrong, the source for the entire pipeline lives on GitHub.