Hardware FAQ
M3 Ultra — frequently asked questions
Direct answers to the questions local-LLM enthusiasts ask about M3 Ultra, drawn from 22 signed runs on llm-speed. Every numerical claim links to a verifiable run permalink at /r/<id>.
What's the fastest LLM on M3 Ultra?
The fastest measured LLM on M3 Ultra on llm-speed is stable-code-instruct-3b-4bit at 192.5 decode tok/s on mlx 0.31.3 (workload chat-short, run r_y2_5y8oo97d). Cite as https://llm-speed.com/r/r_y2_5y8oo97d.
This is the headline decode tokens-per-second across every (model, backend) pairing submitted on M3 Ultra; faster results may exist on hardware not yet benchmarked, but among signed runs this is the published top.
Can I run a 7B-class model on M3 Ultra?
Yes — llm-speed has measured Qwen2.5-7B-Instruct-4bit at 139.6 decode tok/s on mlx 0.31.3 (workload chat-short, run r_5r6rhiynenc) on M3 Ultra, confirming the 7-9B parameter range fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_5r6rhiynenc.
7B-class models on M3 Ultra typically use 4bit; check the run for the exact backend version and model digest.
Can I run a 30B-class model on M3 Ultra?
Yes — llm-speed has measured Qwen3-Coder-30B-A3B-Instruct-4bit at 112.2 decode tok/s on mlx 0.31.3 (workload chat-short, run r_fpsca03u2o_) on M3 Ultra, confirming the 27-33B parameter range, including MoE 30B-A3B variants fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_fpsca03u2o_.
30B-class models on M3 Ultra typically use the published quant scheme on the run page; check the run for the exact backend version and model digest.
Can I run a 70B-class model on M3 Ultra?
Yes — llm-speed has measured llama-3.3-70b-Instruct-4bit at 16.8 decode tok/s on mlx 0.31.3 (workload chat-short, run r_sx3a4y9n-m4) on M3 Ultra, confirming the 70-72B parameter range fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_sx3a4y9n-m4.
70B-class models on M3 Ultra typically use the published quant scheme on the run page; check the run for the exact backend version and model digest.
How does M3 Ultra compare to RTX 5090 for local LLM inference?
For a head-to-head between M3 Ultra and RTX 5090, see the side-by-side comparison page at https://llm-speed.com/vs/m3-ultra-vs-rtx-5090, which lays out every (model, backend) pair where both rigs have a signed run.
As a single-rig anchor, M3 Ultra tops out at 192.5 decode tok/s on stable-code-instruct-3b-4bit via mlx (run r_y2_5y8oo97d); the RTX 5090 top number is on its own /hw/<slug> page so the comparison stays grounded in measured numbers, not extrapolation.
M3 Ultra vs RTX 5090 (side-by-side) · RTX 5090 leaderboard · M3 Ultra leaderboard
Which backend is fastest on M3 Ultra?
The only backend with a signed local run on M3 Ultra so far is mlx, with a top result of 192.5 decode tok/s on stable-code-instruct-3b-4bit (run r_y2_5y8oo97d); a multi-backend comparison on M3 Ultra is not yet published.
Submit a competing run with "llm-speed bench --backends <other-backend>" on a M3 Ultra machine to populate the comparison.
How much unified memory do I need for local LLM inference on M3 Ultra?
M3 Ultra ships with up to 192 GB of unified memory, which sets the ceiling on what fits: a 7B-class model at 4-bit needs roughly 6 GB, a 30B-class model at 4-bit needs roughly 18 GB, and a 70B-class model at 4-bit needs roughly 38 GB.
Long-context workloads add KV-cache pressure on top of the weights — a 32k-token context on a 70B-class model adds another 8-16 GB depending on attention layout, so the practical fit is tighter than the weight-only number.
For exact "did it actually run" evidence on M3 Ultra, the leaderboard at /hw/m3-ultra lists every model that has produced a signed result on this hardware.
Is M3 Ultra worth it for local LLM inference in 2026?
On signed data, M3 Ultra delivers up to 192.5 decode tok/s on stable-code-instruct-3b-4bit via mlx (run r_y2_5y8oo97d), which puts it comfortably above interactive-coding thresholds (~80 tok/s) for the published top configuration.
"Worth it" depends on your model class: M3 Ultra is most useful for 30B-70B class local models where its memory ceiling matters.
For "what fits and how fast", the per-model rows on /hw/m3-ultra are the honest answer; for cross-rig comparisons, see /vs/m3-ultra-vs-rtx-5090.
What quantization should I use on M3 Ultra?
On M3 Ultra, the only quant with a signed local run is 4bit at 139.6 tok/s on Qwen2.5-7B-Instruct-4bit (run r_5r6rhiynenc); a multi-quant comparison on M3 Ultra is not yet published.
Quant choice is a quality-vs-speed tradeoff that this hardware FAQ does not arbitrate; llm-speed publishes hardware-side speed, not output quality. For quality scores, see the model card on Hugging Face and the LMSYS Chatbot Arena.