The RTX 3090 in 2026: still the value pick for local LLMs?
Published 2026-07-02
The RTX 3090 is the most-argued-about GPU in local-LLM communities — a used 24 GB card for a few hundred dollars, and endless claims about how fast it runs models. Almost none of those numbers are verifiable. So we rented one and measured it under our signed, reproducible suite; every figure below links to the exact run.
What a 3090 actually does
On an RTX 3090 (24 GB), Q4 via Ollama: DeepSeek-Coder-V2-16B (a mixture-of-experts) decodes 189 tok/s, Qwen2.5-Coder-7B 139 tok/s, Llama-3.1-8B 136 tok/s, and the dense Qwen2.5-Coder-14B 69 tok/s. Full, current standings are on /hw/rtx-3090 and the cheatsheet.
The MoE surprise
The fastest coder on the card is also one of the largest. DeepSeek-Coder-V2 is a 16B mixture-of-experts with only ~2.4B parameters active per token, so it decodes 189 tok/s — nearly 3x the dense Qwen2.5-Coder-14B at 69 tok/s on the very same 3090. On a memory-bandwidth-bound card, a small-active-parameter MoE is the sweet spot: big-model quality at small-model speed.
Is that fast?
About 137 tok/s is roughly 5x quicker than you can read, and it matches or beats many hosted APIs for an 8B-class model — with zero per-token cost and full privacy. The 24 GB of VRAM comfortably holds a 7–14B coder at Q4 with room for context. The tok/s thresholds are on the glossary.
The value case
At a used street price of roughly $600–800 — or about $0.22/hr to rent in the cloud — the 3090 remains the price/performance pick for running local models fast. It is not the quickest card we measure (that is the RTX 5090), but for the money it is hard to beat.
Reproduce it, or add your rig
Every number here is one command from reproduction:
$ pipx install llm-speed && llm-speed benchNumbers as of July 2026; the linked runs and the cheatsheet always reflect current data. Method: /methodology.