Question 1

What's the fastest LLM on RTX 5090?

Accepted Answer

The fastest measured LLM on RTX 5090 on llm-speed is Qwen3.6-27B-Q4_K_M.gguf at 69.9 decode tok/s on llama.cpp (workload chat-short, run r_bqsunbd6xa8). Cite as https://llm-speed.com/r/r_bqsunbd6xa8. This is the headline decode tokens-per-second across every (model, backend) pairing submitted on RTX 5090; faster results may exist on hardware not yet benchmarked, but among signed runs this is the published top.

Question 2

Can I run a 7B-class model on RTX 5090?

Accepted Answer

No 7B-class model run on RTX 5090 has been submitted to llm-speed yet, so the canonical "yes/no with a measured tok/s" answer is not currently published. RTX 5090 has ~32 GB of VRAM/unified memory, which should fit a 7B model at 4-bit quantization; submit a run with "llm-speed bench --models <hf-id>" to populate this answer.

Question 3

Can I run a 30B-class model on RTX 5090?

Accepted Answer

Yes — llm-speed has measured Qwen3.6-27B-Q4_K_M.gguf at 69.9 decode tok/s on llama.cpp (workload chat-short, run r_bqsunbd6xa8) on RTX 5090, confirming the 27-33B parameter range, including MoE 30B-A3B variants fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_bqsunbd6xa8. 30B-class models on RTX 5090 typically use the published quant scheme on the run page; check the run for the exact backend version and model digest.

Question 4

Can I run a 70B-class model on RTX 5090?

Accepted Answer

Not at native precision: RTX 5090 has roughly 32 GB of VRAM/unified memory and a 70B-class model (70-72B parameter range) needs at least 48 GB even at 4-bit quantization, so a 70B-class model won't fit end-to-end on RTX 5090 as a single-GPU/SoC config. For workloads that do fit, see the RTX 5090 leaderboard for the largest model successfully measured on this hardware.

Question 5

How does RTX 5090 compare to M3 Ultra for local LLM inference?

Accepted Answer

For a head-to-head between RTX 5090 and M3 Ultra, see the side-by-side comparison page at https://llm-speed.com/vs/rtx-5090-vs-m3-ultra, which lays out every (model, backend) pair where both rigs have a signed run. As a single-rig anchor, RTX 5090 tops out at 69.9 decode tok/s on Qwen3.6-27B-Q4_K_M.gguf via llama.cpp (run r_bqsunbd6xa8); the M3 Ultra top number is on its own /hw/<slug> page so the comparison stays grounded in measured numbers, not extrapolation.

Question 6

Which backend is fastest on RTX 5090?

Accepted Answer

The only backend with a signed local run on RTX 5090 so far is llama.cpp, with a top result of 69.9 decode tok/s on Qwen3.6-27B-Q4_K_M.gguf (run r_bqsunbd6xa8); a multi-backend comparison on RTX 5090 is not yet published. Submit a competing run with "llm-speed bench --backends <other-backend>" on a RTX 5090 machine to populate the comparison.

Question 7

How much VRAM do I need for local LLM inference on RTX 5090?

Accepted Answer

RTX 5090 ships with up to 32 GB of VRAM, which sets the ceiling on what fits: a 7B-class model at 4-bit needs roughly 6 GB, a 30B-class model at 4-bit needs roughly 18 GB, and a 70B-class model at 4-bit needs roughly 38 GB. Long-context workloads add KV-cache pressure on top of the weights — a 32k-token context on a 70B-class model adds another 8-16 GB depending on attention layout, so the practical fit is tighter than the weight-only number. For exact "did it actually run" evidence on RTX 5090, the leaderboard at /hw/rtx-5090 lists every model that has produced a signed result on this hardware.

Question 8

Is RTX 5090 worth it for local LLM inference in 2026?

Accepted Answer

On signed data, RTX 5090 delivers up to 69.9 decode tok/s on Qwen3.6-27B-Q4_K_M.gguf via llama.cpp (run r_bqsunbd6xa8), which puts it above conversational-reading speed (~20 tok/s) but below interactive-coding thresholds for the published top configuration. "Worth it" depends on your model class: RTX 5090 is most useful for 7B-30B class local models with comfortable headroom. For "what fits and how fast", the per-model rows on /hw/rtx-5090 are the honest answer; for cross-rig comparisons, see /vs/rtx-5090-vs-m3-ultra.