Skip to content
llm-speed

Hardware FAQ

RTX 5090 — frequently asked questions

Direct answers to the questions local-LLM enthusiasts ask about RTX 5090, drawn from 9 signed runs on llm-speed. Every numerical claim links to a verifiable run permalink at /r/<id>.

What's the fastest LLM on RTX 5090?

The fastest measured LLM on RTX 5090 on llm-speed is Qwen3.6-27B-Q4_K_M.gguf at 69.9 decode tok/s on llama.cpp (workload chat-short, run r_bqsunbd6xa8). Cite as https://llm-speed.com/r/r_bqsunbd6xa8.

This is the headline decode tokens-per-second across every (model, backend) pairing submitted on RTX 5090; faster results may exist on hardware not yet benchmarked, but among signed runs this is the published top.

Run r_bqsunbd6xa8 · RTX 5090 leaderboard

Can I run a 7B-class model on RTX 5090?

No 7B-class model run on RTX 5090 has been submitted to llm-speed yet, so the canonical "yes/no with a measured tok/s" answer is not currently published.

RTX 5090 has ~32 GB of VRAM/unified memory, which should fit a 7B model at 4-bit quantization; submit a run with "llm-speed bench --models <hf-id>" to populate this answer.

RTX 5090 leaderboard

Can I run a 30B-class model on RTX 5090?

Yes — llm-speed has measured Qwen3.6-27B-Q4_K_M.gguf at 69.9 decode tok/s on llama.cpp (workload chat-short, run r_bqsunbd6xa8) on RTX 5090, confirming the 27-33B parameter range, including MoE 30B-A3B variants fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_bqsunbd6xa8.

30B-class models on RTX 5090 typically use the published quant scheme on the run page; check the run for the exact backend version and model digest.

Run r_bqsunbd6xa8 · Every model on RTX 5090

Can I run a 70B-class model on RTX 5090?

Not at native precision: RTX 5090 has roughly 32 GB of VRAM/unified memory and a 70B-class model (70-72B parameter range) needs at least 48 GB even at 4-bit quantization, so a 70B-class model won't fit end-to-end on RTX 5090 as a single-GPU/SoC config.

For workloads that do fit, see the RTX 5090 leaderboard for the largest model successfully measured on this hardware.

RTX 5090 leaderboard

How does RTX 5090 compare to M3 Ultra for local LLM inference?

For a head-to-head between RTX 5090 and M3 Ultra, see the side-by-side comparison page at https://llm-speed.com/vs/rtx-5090-vs-m3-ultra, which lays out every (model, backend) pair where both rigs have a signed run.

As a single-rig anchor, RTX 5090 tops out at 69.9 decode tok/s on Qwen3.6-27B-Q4_K_M.gguf via llama.cpp (run r_bqsunbd6xa8); the M3 Ultra top number is on its own /hw/<slug> page so the comparison stays grounded in measured numbers, not extrapolation.

RTX 5090 vs M3 Ultra (side-by-side) · M3 Ultra leaderboard · RTX 5090 leaderboard

Which backend is fastest on RTX 5090?

The only backend with a signed local run on RTX 5090 so far is llama.cpp, with a top result of 69.9 decode tok/s on Qwen3.6-27B-Q4_K_M.gguf (run r_bqsunbd6xa8); a multi-backend comparison on RTX 5090 is not yet published.

Submit a competing run with "llm-speed bench --backends <other-backend>" on a RTX 5090 machine to populate the comparison.

Run r_bqsunbd6xa8 · RTX 5090 leaderboard

How much VRAM do I need for local LLM inference on RTX 5090?

RTX 5090 ships with up to 32 GB of VRAM, which sets the ceiling on what fits: a 7B-class model at 4-bit needs roughly 6 GB, a 30B-class model at 4-bit needs roughly 18 GB, and a 70B-class model at 4-bit needs roughly 38 GB.

Long-context workloads add KV-cache pressure on top of the weights — a 32k-token context on a 70B-class model adds another 8-16 GB depending on attention layout, so the practical fit is tighter than the weight-only number.

For exact "did it actually run" evidence on RTX 5090, the leaderboard at /hw/rtx-5090 lists every model that has produced a signed result on this hardware.

Models successfully run on RTX 5090

Is RTX 5090 worth it for local LLM inference in 2026?

On signed data, RTX 5090 delivers up to 69.9 decode tok/s on Qwen3.6-27B-Q4_K_M.gguf via llama.cpp (run r_bqsunbd6xa8), which puts it above conversational-reading speed (~20 tok/s) but below interactive-coding thresholds for the published top configuration.

"Worth it" depends on your model class: RTX 5090 is most useful for 7B-30B class local models with comfortable headroom.

For "what fits and how fast", the per-model rows on /hw/rtx-5090 are the honest answer; for cross-rig comparisons, see /vs/rtx-5090-vs-m3-ultra.

RTX 5090 leaderboard · RTX 5090 vs M3 Ultra