Best GPU for local LLMs in 2026
RTX 5090, 4090, used 3090, 5070 Ti and the rest — ranked on real decode tok/s across the models people actually run locally. No affiliate picks, just submitted runs.
As of today, the fastest submitted GPU result is RTX 5090 (32GB) at 318.4tok/s on gpt-oss-20b. Match the card to the model: a 24-32 GB card holds a 7-32B coder at 4-bit and decodes it fastest, while 70B-class weights need ~48 GB (dual-card) or Apple unified memory. Every row links a signed run.
Fastest GPU on the leaderboard: RTX 5090 (32GB) at 318.4tok/s.
The best GPU depends on what you run. For a 7-14B coding model, memory bandwidth sets decode speed and a single 24-32 GB card is plenty; for a 70B dense model, total VRAM (or a two-card split) is the gate. The RTX 5090's 32 GB GDDR7 at ~1.8 TB/s makes it the fastest single consumer card on the leaderboard for any model that fits in 32 GB; a used pair of 3090s (48 GB combined) is the value route to 70B-class weights; AMD's cards trade CUDA's tooling depth for more VRAM per dollar. Below is every submitted GPU run, ranked by decode tok/s, with the model and quant on each row so you can match a card to your actual workload rather than a headline number. If your card + model pairing isn't here yet, run the suite and submit — the row appears next refresh.
Submitted benchmarks
| GPU | Best model | decode tok/s | Run |
|---|---|---|---|
| RTX 5090 (32GB) | gpt-oss-20b | 318.4tok/s | r_b9ul-vxh9sc |
See also: All hardware · All models · Methodology