How fast is Gemma 3 on an RTX 4090? (4B, 12B, 27B)

Published 2026-07-03

Gemma 3 is Google's open model family, and its 4B, 12B, and 27B sizes all fit a 24 GB RTX 4090 at Q4, so the only question is how fast each one runs. We measured all three, signed and reproducible. Every number links to its run.

Decode speed by size on a 4090

Gemma 3 4B decodes 195 tok/s, 12B 93 tok/s, and 27B 47 tok/s, all at Q4. Unlike a large mixture-of-experts, these are dense models, so speed scales down roughly with size; but every size fits in 24 GB with headroom, so there is no memory cliff, just a clean speed-versus-quality choice.

Which should you run?

The 27B is the strongest and still streams at 47 tok/s, comfortably above reading speed, so it is a fine all-day model on a 4090. The 12B at 93 tok/s is the balanced daily driver, and the 4B at 195 tok/s is for latency-sensitive or high-volume tasks. Confirm the fit for any card with the VRAM-fit checker, and see how the 4090 stacks up against other cards in the 3090 vs 4090 comparison.

What about the bigger context?

Gemma 3 supports long context, and decode is memory-bandwidth-bound, so the tok/s above hold for short prompts; very long prompts add prefill time before the first token but do not change the streaming rate. Full standings across every model and rig are on the cheatsheet, with definitions on the glossary.

Reproduce it, or add your rig

The same one-line install measures any size:

$ pipx install llm-speed && llm-speed bench

Numbers as of July 2026; the linked runs and the cheatsheet always reflect current data.