What's the fastest GPU for running Llama 3.1 8B locally?

Published 2026-07-02

Llama 3.1 8B is one of the most-run local models: small enough for almost any GPU, capable enough for everyday chat and coding. So which hardware runs it fastest? We measured its decode speed on each, signed and reproducible; every number links to the run that produced it.

Llama 3.1 8B, ranked by decode tok/s

RTX 5090: 232 tok/s.
RTX 4090: 154 tok/s.
RTX 3090: 136 tok/s.
Apple M3 Pro: 29 tok/s.

The RTX 5090 is fastest by a wide margin at 232 tok/s, roughly 8x reading speed. Full standings across every rig are on the cheatsheet.

Which should you pick?

All of these run Llama 3.1 8B comfortably at Q4 (the weights need only about 5 GB). For maximum speed, the RTX 5090. For the best value, the RTX 3090 already beats most hosted APIs at 136 tok/s for a fraction of a 5090's price; the RTX 4090 sits in between. A laptop-class Apple chip (M3 Pro) runs it too, just far slower; a desktop-class M-series (Max or Ultra) with more GPU cores is considerably quicker than the Pro shown here.

Reproduce it

Every number is one command from reproduction:

$ pipx install llm-speed && llm-speed bench

Numbers as of July 2026; the linked runs and the cheatsheet always reflect current data.