DeepSeek-R1 on an RTX 4090: which size actually fits?
Published 2026-07-02
DeepSeek-R1 is a reasoning model: it thinks in long token streams before answering, so decode speed matters more than usual, and every second of thinking is bounded by tok/s. The question people keep asking is which size fits a 24 GB RTX 4090. We measured it, signed and reproducible. Every number links to its run.
Decode speed by size on a 4090
DeepSeek-R1 8B decodes 134 tok/s and 14B decodes 81 tok/s, both at Q4 with room to spare on 24 GB. The 32B is a different story: it does not fit 24 GB at Q4 once context is added, so it offloads layers to system RAM and crawls to under 4 tok/s. That last figure is a fit result, not a fair decode number: the 32B simply needs a bigger card.
Why size matters more for a reasoning model
A reasoning model emits hundreds to thousands of hidden thinking tokens before the visible answer, so a slow decode rate compounds into a long wait. At 81 tok/s the 14B streams a full chain of thought in seconds; at under 4 tok/s the offloaded 32B would take minutes. Decode is memory-bandwidth-bound, so what counts is whether the weights fit in VRAM, which you can check for any pair with the VRAM-fit checker.
Which should you run?
On a 4090, the 14B is the sweet spot: it fits comfortably, keeps the stronger reasoning of the larger distill, and still decodes fast. Pick the 8B if you want maximum speed for lighter tasks. For the 32B, step up to a 32 GB card like the RTX 5090 or a high-memory Apple. Full standings are on the cheatsheet, and the 3090 vs 4090 comparison covers the card itself.
Reproduce it, or add your rig
The same one-line install measures any size:
$ pipx install llm-speed && llm-speed benchNumbers as of July 2026; the linked runs and the cheatsheet always reflect current data.