▸ vLLM TP=4 · April 25, 2026 · 12 min read

vLLM TP=4 on 4×RTX 3090: 76.9 tok/s, no marketing spin

HYDRA is four used RTX 3090s in a single workstation, running Gemma-4-31B at AWQ-4bit through vLLM 0.19.1 with tensor parallelism set to four. The headline number is 76.9 tok/s single-stream decode, measured by vLLM's own request-level metrics, averaged across hundreds of hot runs over the past month. This post is the one I wanted to read before I built the rig — what scales, what doesn't, what I tried that didn't work, and where this build actually earns its keep against Ollama Cloud.

Numbers below come from the same probe that feeds the live dashboard. Methodology is on its own page; the short version is one warmup run discarded, one timed hot run recorded, hourly, with engine-reported timing rather than wall-clock from the bench host.

The build, briefly

Four RTX 3090s — uniform Ampere, no 3090 Ti, no Blackwell, no NVLink bridges. 96 GB total VRAM (4 × 24). Top pair sits on PCIe Gen3/4 x16, bottom pair on x8 due to motherboard lane limits. Pure SM86 — no FP8 hardware path, no MXFP4 path. That matters; it ruled out half the optimization paths I went looking for. We arrived at this configuration by subtraction: an earlier HYDRA run had a 3090 Ti and two NVLink bridges, both of which got removed once the data showed they weren't paying their way (more on the bridges below).

The full bill of materials lives on the HYDRA build page, with live eBay/Amazon prices for replacement parts.

The number, three ways

Decode throughput on Gemma-4-31B-it AWQ at 32K context, batch=1, single-stream, temperature=0:

hot speed-bench (back-to-back warm runs)   81.35 tok/s
production rolling 7-day median            76.9 tok/s
NVLink-removed re-bench                    75.4 tok/s

The 81 number is the speed-bench: warm runs back-to-back on a fresh server, prompt cached, no concurrent requests. The 76.9 is what real production-style hourly probing shows over a month — slightly lower because every hour the model is hit cold-ish (vLLM keeps weights resident, but the KV cache and attention paging state churn). The 75.4 is the same harness immediately after the NVLink bridges were physically pulled. That gap is real but inside run-to-run noise; more on that below.

What TP=4 actually buys you

Going from TP=2 (two GPUs on the upper pair) to TP=4 (all four GPUs, AWQ Marlin):

Gemma-4-31B AWQ        TP=2: 63.31    TP=4: 81.35    +28.5%
Qwen3.6-27B GPTQ-Pro   TP=2: 72.18    TP=4: 84.17    +16.6%

Two things stand out. First, you don't get 2× from doubling the GPU count — you get 17–29%. Tensor parallelism splits weight matrices across devices, but each forward pass needs an all-reduce to recombine partial sums, and that synchronization eats into the gain. Second, the dense Gemma scales better than the hybrid Qwen3.6 — Qwen3.6 has 16 standard attention layers and 48 SSM (state-space) layers, and the SSM layers don't tensor-parallelize as cleanly, so adding cards helps less.

For pure dense transformer decode at single-stream batch=1, the marginal value of card #3 and #4 is real but rapidly diminishing. If you're sizing a build for inference only and don't need the VRAM to fit a bigger model, two 3090s on the same pair get you ~78% of the way there for half the GPU cost.

Eager mode is a 6× regression

vLLM has a --enforce-eager flag that disables CUDA graph capture. People reach for it when they hit obscure hangs or want lower memory pressure during dev. Don't ship it:

Qwen3.6-27B  TP=2  Marlin  graphs on:    72.18 tok/s
Qwen3.6-27B  TP=2  Marlin  --enforce-eager: 11.74 tok/s

Eager mode is 6.1× slower on this hybrid architecture. CUDA graphs batch tens to hundreds of small kernel launches into a single GPU-side replay; without them you pay full kernel-launch latency on every layer, every token. On a transformer with 60+ layers and a few hundred output tokens, that's tens of thousands of unnecessary CPU-GPU round trips per request. If a tutorial tells you to set --enforce-eager, close the tab.

NVLink, ablated

Two PNY 3-slot NVLink bridges were on the rig at one point. They contributed roughly nothing on this workload, so they came out:

with bridges    78.6 tok/s   (vllm:request_decode_time, locked baseline)
no bridges      75.4 tok/s   (-4.1%, within run-to-run noise)

The 4% gap is well inside what we see across hot runs on the same physical config — meaning if you ran the test twice with bridges installed both times, you'd sometimes see a 4% spread between runs. The bridges weren't earning their keep here.

Why not? Three reasons compound. AWQ-4bit weights mean the all-reduce payloads are tiny — they fit comfortably inside PCIe Gen3/4 x16 bandwidth without saturating it. CUDA graphs batch many ops between the synchronization points, so any per-call latency advantage NVLink might have gets amortized to near-zero. And the cross-pair traffic (between bridge A and bridge B) is PCIe-only no matter what; bridges only help within a pair.

Bridges would still matter for BF16 / FP16 full-precision models (~10× the activation traffic), for pure TP=2 with no cross-pair PCIe bottleneck, and for training where gradient all-reduces are heavier than inference. For AWQ-4bit single-stream serving, they're decoration. We pulled them and kept them out — and recommend you don't bother sourcing them for this workload class.

Why we landed on AWQ-4bit

Quant matrix for Gemma-4-31B on Ampere SM86, ranked by what we actually shipped:

AWQ-4bit       17 GB   Marlin path           ✅ shipped
GPTQ-4bit      17 GB   Marlin path           tied — no reason to switch
INT4 AutoRound 19 GB   slower kernels        no
FP8 (W8A8)     31 GB   Ampere has no FP8     ❌ won't run
NVFP4          19 GB   Blackwell-only        ❌ wrong arch
BF16 full      59 GB   would fit TP=4 tight  no — slower, no quality win at this size
Q4_K_M (GGUF)  18 GB   GGUF in vLLM is slow  no — 5–10× slower than Marlin paths

Verdict: AWQ-4bit is the right answer for SM86 right now. Not because it's the newest format — it isn't — but because it's the format with the fastest production kernel (Marlin) on the hardware we have. There is no upside trade available without changing the hardware. Anyone telling you to "just use FP8" hasn't checked whether your GPUs actually support it.

KV cache quantization: a dead end on Ampere

Long-context inference burns VRAM on the KV cache, not the weights. We spent real time trying to shrink it. Status report:

--kv-cache-dtype fp8_e4m3      ❌ fails on Ampere (no FP8 HW)
--kv-cache-dtype fp8_e5m2      ⚠️  gated on AWQ checkpoints
--kv-cache-dtype int8          ❌ not implemented (vLLM #33480)
TurboQuant (vLLM 0.20+)        ⚠️  trades 50–65% decode for KV pool
Sliding-window KV optimization ❌ not implemented (vLLM #39133)

None of these is a viable production path on this hardware today. TurboQuant is the closest to working, but the throughput cost is severe enough that it's only a win if your workload is KV-bound rather than decode-bound — which most agent loops aren't. We left the KV cache in BF16 and accepted the VRAM cost; if it ever becomes the binding constraint, the fix is more cards, not a different KV format.

The 6×3090 myth

A reasonable person looks at "TP=4 gave us +28% over TP=2" and asks: would TP=5 or TP=6 give us another 15–25%? Tempting. Cheaper used 3090s plus a bigger PSU vs. the cost of moving to a new architecture.

It doesn't work, and the reason is structural: TP must evenly divide the model's attention head count. Gemma-4-31B has 32 attention heads. 32 / 4 = 8 heads per GPU, clean. 32 / 5 = 6.4, not clean — TP=5 silently fails or refuses to start. 32 / 6 = 5.33, same. 32 / 8 = 4, would work, but you'd need eight cards.

Qwen3.6-27B has 16 attention heads, so on Qwen the valid TP rungs are 1, 2, 4, 8, 16. Other models have other counts. The point is the GPU count isn't the choice — the head divisibility is the choice, and the GPU count follows.

What this means in practice: if you're shopping the used GPU market and you see a deal on a fifth 3090, walk away. It won't help vLLM. It would help Ollama (which uses sequential layer-split, not TP, and doesn't care about head counts) — but Ollama is materially slower than vLLM on the same hardware (see the next section), so you'd be paying for a card that only helps the slower runtime. The right move is to make TP=4 work harder, not to add card #5.

vLLM vs Ollama on the same rig

Same hardware, same model, two runtimes. Single-stream Gemma-4-31B decode, hot:

vLLM 0.19.1   TP=4   AWQ-4bit Marlin    76.9 tok/s
Ollama 0.21.2 layer-split              ~41 tok/s  (32K ctx)

vLLM is roughly 1.9× faster on this build for the same model. The reason is real tensor parallelism — Ollama splits layers across GPUs and walks them sequentially per token, so the chain runs at the speed of the slowest link. vLLM splits each layer's weight matrices across all four GPUs and recombines, so all four cards do work on every token in parallel.

The trade is that vLLM is more brittle to operate. Head-count divisibility, CUDA graph quirks, AWQ checkpoint gating, model-serve-name-vs-root-path confusion. Ollama is dramatically friendlier — pull the model, start serving, done. For experimentation, Ollama. For production single-stream decode where every tok/s matters, vLLM.

vLLM vs Ollama Cloud

Same model (gemma4-31b), local TP=4 vs. managed cloud, over a 7-day rolling window pulled live from the dashboard's database:

hydra-vllm-gemma4               73.7 tok/s median   ±1% CV
ollama-cloud-gemma4-31b         33.8 tok/s median   ~19× range
                                                    (2.9 to 54.7)

HYDRA is around 2.2× faster on the same model with dramatically tighter run-to-run consistency. The cloud number's range — 2.9 to 54.7 tok/s across the 7 days — reflects cold-start behavior at the edge: when traffic to a given model is intermittent, the cloud's worker pool spins resources down, and the first request after a quiet period eats the warm-up cost. HYDRA, with the model held resident, has no equivalent — its variance over a month of hourly probes is ±1% CV, the cloud's runs from ±18 tok/s sample-to-sample to a stable mean of ~54 when warm.

The headline number on the home page right now is "HYDRA is currently 118% faster than Ollama Cloud." That's the same comparison expressed from the cloud's side. Both numbers are correct; we lead with 118 because it's the gap relative to what most readers are paying for.

Where HYDRA stops being the right answer

Honest section. HYDRA is great at ≤80B models that fit comfortably in 96 GB VRAM with room for KV cache and CUDA graph workspace at 32K context. It is the wrong build for two cases:

Bursty intermittent workloads. If your traffic is one request every few hours, you're paying ~170 W idle to keep the model warm. Renting cloud GPU time, or hitting a managed endpoint, is cheaper at low duty cycle. The crossover is somewhere around 10–15% utilization depending on power cost.
Models that need more VRAM than you have. gpt-oss:120b is 60.9 GB on disk in MXFP4 and runs cleanly on HYDRA at 113 tok/s under Ollama (full GPU-resident). Try the same model on a 64 GB consumer Blackwell build (we did, separately, on our SCOUT rig) and ~7.7 GB spills to system RAM, throughput collapses to ~29 tok/s. The lesson runs both ways: build for the model class you actually intend to serve, not the one one tier up.

What we're working on next

TTFT measurement. Current probe doesn't stream, so first-token-latency isn't recorded. It's the next piece of the dashboard.
Concurrency sweeps. Single-stream is the right number for the dashboard's editorial purpose ("how fast does this feel"), but real serving cares about throughput at batch sizes 4, 8, 16. GuideLLM-style methodology, separate page.
Multi-region probing. All current probes originate from one LAN. If you're hitting Ollama Cloud from Singapore, your numbers will not look like ours.

Numbers in this post are pulled from the live database; the headline 76.9 will drift as more hourly probes land. The dashboard is the source of truth — this post is a snapshot of what the data has been saying for the past month.