Question 1

What's the fastest LLM on M3 Ultra?

Accepted Answer

The fastest measured LLM on M3 Ultra on llm-speed is stable-code-instruct-3b-4bit at 192.5 decode tok/s on mlx 0.31.3 (workload chat-short, run r_y2_5y8oo97d). Cite as https://llm-speed.com/r/r_y2_5y8oo97d. This is the headline decode tokens-per-second across every (model, backend) pairing submitted on M3 Ultra; faster results may exist on hardware not yet benchmarked, but among signed runs this is the published top.

Question 2

Can I run a 7B-class model on M3 Ultra?

Accepted Answer

Yes — llm-speed has measured Qwen2.5-7B-Instruct-4bit at 139.6 decode tok/s on mlx 0.31.3 (workload chat-short, run r_5r6rhiynenc) on M3 Ultra, confirming the 7-9B parameter range fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_5r6rhiynenc. 7B-class models on M3 Ultra typically use 4bit; check the run for the exact backend version and model digest.

Question 3

Can I run a 30B-class model on M3 Ultra?

Accepted Answer

Yes — llm-speed has measured Qwen3-Coder-30B-A3B-Instruct-4bit at 112.2 decode tok/s on mlx 0.31.3 (workload chat-short, run r_fpsca03u2o_) on M3 Ultra, confirming the 27-33B parameter range, including MoE 30B-A3B variants fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_fpsca03u2o_. 30B-class models on M3 Ultra typically use the published quant scheme on the run page; check the run for the exact backend version and model digest.

Question 4

Can I run a 70B-class model on M3 Ultra?

Accepted Answer

Yes — llm-speed has measured llama-3.3-70b-Instruct-4bit at 16.8 decode tok/s on mlx 0.31.3 (workload chat-short, run r_sx3a4y9n-m4) on M3 Ultra, confirming the 70-72B parameter range fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_sx3a4y9n-m4. 70B-class models on M3 Ultra typically use the published quant scheme on the run page; check the run for the exact backend version and model digest.

Question 5

How does M3 Ultra compare to RTX 5090 for local LLM inference?

Accepted Answer

For a head-to-head between M3 Ultra and RTX 5090, see the side-by-side comparison page at https://llm-speed.com/vs/m3-ultra-vs-rtx-5090, which lays out every (model, backend) pair where both rigs have a signed run. As a single-rig anchor, M3 Ultra tops out at 192.5 decode tok/s on stable-code-instruct-3b-4bit via mlx (run r_y2_5y8oo97d); the RTX 5090 top number is on its own /hw/<slug> page so the comparison stays grounded in measured numbers, not extrapolation.

Question 6

Which backend is fastest on M3 Ultra?

Accepted Answer

The only backend with a signed local run on M3 Ultra so far is mlx, with a top result of 192.5 decode tok/s on stable-code-instruct-3b-4bit (run r_y2_5y8oo97d); a multi-backend comparison on M3 Ultra is not yet published. Submit a competing run with "llm-speed bench --backends <other-backend>" on a M3 Ultra machine to populate the comparison.

Question 7

How much unified memory do I need for local LLM inference on M3 Ultra?

Accepted Answer

M3 Ultra ships with up to 192 GB of unified memory, which sets the ceiling on what fits: a 7B-class model at 4-bit needs roughly 6 GB, a 30B-class model at 4-bit needs roughly 18 GB, and a 70B-class model at 4-bit needs roughly 38 GB. Long-context workloads add KV-cache pressure on top of the weights — a 32k-token context on a 70B-class model adds another 8-16 GB depending on attention layout, so the practical fit is tighter than the weight-only number. For exact "did it actually run" evidence on M3 Ultra, the leaderboard at /hw/m3-ultra lists every model that has produced a signed result on this hardware.

Question 8

Is M3 Ultra worth it for local LLM inference in 2026?

Accepted Answer

On signed data, M3 Ultra delivers up to 192.5 decode tok/s on stable-code-instruct-3b-4bit via mlx (run r_y2_5y8oo97d), which puts it comfortably above interactive-coding thresholds (~80 tok/s) for the published top configuration. "Worth it" depends on your model class: M3 Ultra is most useful for 30B-70B class local models where its memory ceiling matters. For "what fits and how fast", the per-model rows on /hw/m3-ultra are the honest answer; for cross-rig comparisons, see /vs/m3-ultra-vs-rtx-5090.

Question 9

What quantization should I use on M3 Ultra?

Accepted Answer

On M3 Ultra, the only quant with a signed local run is 4bit at 139.6 tok/s on Qwen2.5-7B-Instruct-4bit (run r_5r6rhiynenc); a multi-quant comparison on M3 Ultra is not yet published. Quant choice is a quality-vs-speed tradeoff that this hardware FAQ does not arbitrate; llm-speed publishes hardware-side speed, not output quality. For quality scores, see the model card on Hugging Face and the LMSYS Chatbot Arena.