Best GPU for local LLMs in 2026

RTX 5090, 4090, used 3090, 5070 Ti and the rest — ranked on real decode tok/s across the models people actually run locally. No affiliate picks, just submitted runs.

Verdict

As of today, the fastest submitted GPU result is RTX 5090 (32GB) at 318.4tok/s on gpt-oss-20b. Match the card to the model: a 24-32 GB card holds a 7-32B coder at 4-bit and decodes it fastest, while 70B-class weights need ~48 GB (dual-card) or Apple unified memory. Every row links a signed run.

Recommendation

Fastest GPU on the leaderboard: RTX 5090 (32GB) at 318.4tok/s.

The best GPU depends on what you run. For a 7-14B coding model, memory bandwidth sets decode speed and a single 24-32 GB card is plenty; for a 70B dense model, total VRAM (or a two-card split) is the gate. The RTX 5090's 32 GB GDDR7 at ~1.8 TB/s makes it the fastest single consumer card on the leaderboard for any model that fits in 32 GB; a used pair of 3090s (48 GB combined) is the value route to 70B-class weights; AMD's cards trade CUDA's tooling depth for more VRAM per dollar. Below is every submitted GPU run, ranked by decode tok/s, with the model and quant on each row so you can match a card to your actual workload rather than a headline number. If your card + model pairing isn't here yet, run the suite and submit — the row appears next refresh.

Submitted benchmarks

GPU	Best model	decode tok/s	Run
RTX 5090 (32GB)	gpt-oss-20b	318.4tok/s	r_b9ul-vxh9sc

See also: All hardware · All models · Methodology