Skip to content
llm-speed

Hardware FAQ

M3 Ultra — frequently asked questions

Direct answers to the questions local-LLM enthusiasts ask about M3 Ultra, drawn from 22 signed runs on llm-speed. Every numerical claim links to a verifiable run permalink at /r/<id>.

What's the fastest LLM on M3 Ultra?

The fastest measured LLM on M3 Ultra on llm-speed is stable-code-instruct-3b-4bit at 192.5 decode tok/s on mlx 0.31.3 (workload chat-short, run r_y2_5y8oo97d). Cite as https://llm-speed.com/r/r_y2_5y8oo97d.

This is the headline decode tokens-per-second across every (model, backend) pairing submitted on M3 Ultra; faster results may exist on hardware not yet benchmarked, but among signed runs this is the published top.

Run r_y2_5y8oo97d · M3 Ultra leaderboard

Can I run a 7B-class model on M3 Ultra?

Yes — llm-speed has measured Qwen2.5-7B-Instruct-4bit at 139.6 decode tok/s on mlx 0.31.3 (workload chat-short, run r_5r6rhiynenc) on M3 Ultra, confirming the 7-9B parameter range fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_5r6rhiynenc.

7B-class models on M3 Ultra typically use 4bit; check the run for the exact backend version and model digest.

Run r_5r6rhiynenc · Every model on M3 Ultra

Can I run a 30B-class model on M3 Ultra?

Yes — llm-speed has measured Qwen3-Coder-30B-A3B-Instruct-4bit at 112.2 decode tok/s on mlx 0.31.3 (workload chat-short, run r_fpsca03u2o_) on M3 Ultra, confirming the 27-33B parameter range, including MoE 30B-A3B variants fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_fpsca03u2o_.

30B-class models on M3 Ultra typically use the published quant scheme on the run page; check the run for the exact backend version and model digest.

Run r_fpsca03u2o_ · Every model on M3 Ultra

Can I run a 70B-class model on M3 Ultra?

Yes — llm-speed has measured llama-3.3-70b-Instruct-4bit at 16.8 decode tok/s on mlx 0.31.3 (workload chat-short, run r_sx3a4y9n-m4) on M3 Ultra, confirming the 70-72B parameter range fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_sx3a4y9n-m4.

70B-class models on M3 Ultra typically use the published quant scheme on the run page; check the run for the exact backend version and model digest.

Run r_sx3a4y9n-m4 · Every model on M3 Ultra

How does M3 Ultra compare to RTX 5090 for local LLM inference?

For a head-to-head between M3 Ultra and RTX 5090, see the side-by-side comparison page at https://llm-speed.com/vs/m3-ultra-vs-rtx-5090, which lays out every (model, backend) pair where both rigs have a signed run.

As a single-rig anchor, M3 Ultra tops out at 192.5 decode tok/s on stable-code-instruct-3b-4bit via mlx (run r_y2_5y8oo97d); the RTX 5090 top number is on its own /hw/<slug> page so the comparison stays grounded in measured numbers, not extrapolation.

M3 Ultra vs RTX 5090 (side-by-side) · RTX 5090 leaderboard · M3 Ultra leaderboard

Which backend is fastest on M3 Ultra?

The only backend with a signed local run on M3 Ultra so far is mlx, with a top result of 192.5 decode tok/s on stable-code-instruct-3b-4bit (run r_y2_5y8oo97d); a multi-backend comparison on M3 Ultra is not yet published.

Submit a competing run with "llm-speed bench --backends <other-backend>" on a M3 Ultra machine to populate the comparison.

Run r_y2_5y8oo97d · M3 Ultra leaderboard

How much unified memory do I need for local LLM inference on M3 Ultra?

M3 Ultra ships with up to 192 GB of unified memory, which sets the ceiling on what fits: a 7B-class model at 4-bit needs roughly 6 GB, a 30B-class model at 4-bit needs roughly 18 GB, and a 70B-class model at 4-bit needs roughly 38 GB.

Long-context workloads add KV-cache pressure on top of the weights — a 32k-token context on a 70B-class model adds another 8-16 GB depending on attention layout, so the practical fit is tighter than the weight-only number.

For exact "did it actually run" evidence on M3 Ultra, the leaderboard at /hw/m3-ultra lists every model that has produced a signed result on this hardware.

Models successfully run on M3 Ultra

Is M3 Ultra worth it for local LLM inference in 2026?

On signed data, M3 Ultra delivers up to 192.5 decode tok/s on stable-code-instruct-3b-4bit via mlx (run r_y2_5y8oo97d), which puts it comfortably above interactive-coding thresholds (~80 tok/s) for the published top configuration.

"Worth it" depends on your model class: M3 Ultra is most useful for 30B-70B class local models where its memory ceiling matters.

For "what fits and how fast", the per-model rows on /hw/m3-ultra are the honest answer; for cross-rig comparisons, see /vs/m3-ultra-vs-rtx-5090.

M3 Ultra leaderboard · M3 Ultra vs RTX 5090

What quantization should I use on M3 Ultra?

On M3 Ultra, the only quant with a signed local run is 4bit at 139.6 tok/s on Qwen2.5-7B-Instruct-4bit (run r_5r6rhiynenc); a multi-quant comparison on M3 Ultra is not yet published.

Quant choice is a quality-vs-speed tradeoff that this hardware FAQ does not arbitrate; llm-speed publishes hardware-side speed, not output quality. For quality scores, see the model card on Hugging Face and the LMSYS Chatbot Arena.

4bit: 139.6 tok/s (run r_5r6rhiynenc)