Decode vs prefill tok/s: what LLM speed numbers actually mean
Published 2026-07-02
Every LLM speed benchmark is really two numbers, and conflating them is the most common mistake when reading them. A model has a prefill speed and a decode speed, and they can differ by more than 20x.
Prefill vs decode
Prefill is the model reading your prompt — it processes every input token in parallel, so it is fast. Decode is the model writing its answer — it generates one token at a time, each depending on the last, so it is memory-bandwidth-bound and much slower. Prefill sets your time-to-first-token (TTFT); decode sets how fast the answer streams.
The gap, measured
On an RTX 3090, Qwen2.5-Coder-7B prefills at ~3,300 tok/s but decodes at only ~135 tok/s — prefill is roughly 25x faster. The pattern holds everywhere we measure: prefill in the thousands, decode in the tens-to-hundreds. Both columns are on the cheatsheet.
Which number matters for you
For interactive chat and coding, decode is what you feel — it is the stream speed. For long prompts, large codebases, or RAG, prefill / TTFT dominates — it is the wait before the first token. A rough rule: TTFT ≈ (prompt tokens) ÷ (prefill tok/s), so a 10k-token prompt at ~3,300 tok/s is about 3 seconds before the answer starts. Definitions and thresholds are on the glossary.
See it yourself
Every run we publish reports both numbers, signed and reproducible:
$ pipx install llm-speed && llm-speed benchNumbers as of July 2026; the linked run and the cheatsheet always reflect current data.