Question 1

What's the fastest LLM on M3 Pro?

Accepted Answer

The fastest measured LLM on M3 Pro on llm-speed is Qwen2.5-0.5B-Instruct-4bit at 286.5 decode tok/s on mlx 0.31.3 (workload chat-short, run r_akcbpx5vcqa). Cite as https://llm-speed.com/r/r_akcbpx5vcqa. This is the headline decode tokens-per-second across every (model, backend) pairing submitted on M3 Pro; faster results may exist on hardware not yet benchmarked, but among signed runs this is the published top.

Question 2

Can I run a 7B-class model on M3 Pro?

Accepted Answer

Yes — llm-speed has measured Qwen2.5-7B-Instruct-4bit at 30.5 decode tok/s on mlx 0.31.3 (workload chat-short, run r_llzv_g-ymaf) on M3 Pro, confirming the 7-9B parameter range fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_llzv_g-ymaf. 7B-class models on M3 Pro typically use 4bit; check the run for the exact backend version and model digest.

Question 3

Can I run a 30B-class model on M3 Pro?

Accepted Answer

Yes — llm-speed has measured Qwen3-32B-4bit at 7.2 decode tok/s on mlx 0.31.3 (workload chat-short, run r_pnrrpcdqfo4) on M3 Pro, confirming the 27-33B parameter range, including MoE 30B-A3B variants fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_pnrrpcdqfo4. 30B-class models on M3 Pro typically use the published quant scheme on the run page; check the run for the exact backend version and model digest.

Question 4

Can I run a 70B-class model on M3 Pro?

Accepted Answer

Yes — llm-speed has measured llama-3.3-70b-instruct at 43.2 decode tok/s on hosted-api (workload chat-short, run r_tthgrsb7zn5) on M3 Pro, confirming the 70-72B parameter range fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_tthgrsb7zn5. 70B-class models on M3 Pro typically use the published quant scheme on the run page; check the run for the exact backend version and model digest.

Question 5

How does M3 Pro compare to M3 Max for local LLM inference?

Accepted Answer

For a head-to-head between M3 Pro and M3 Max, see the side-by-side comparison page at https://llm-speed.com/vs/m3-pro-vs-m3-max, which lays out every (model, backend) pair where both rigs have a signed run. As a single-rig anchor, M3 Pro tops out at 286.5 decode tok/s on Qwen2.5-0.5B-Instruct-4bit via mlx (run r_akcbpx5vcqa); the M3 Max top number is on its own /hw/<slug> page so the comparison stays grounded in measured numbers, not extrapolation.

Question 6

Which backend is fastest on M3 Pro?

Accepted Answer

The only backend with a signed local run on M3 Pro so far is mlx, with a top result of 286.5 decode tok/s on Qwen2.5-0.5B-Instruct-4bit (run r_akcbpx5vcqa); a multi-backend comparison on M3 Pro is not yet published. Submit a competing run with "llm-speed bench --backends <other-backend>" on a M3 Pro machine to populate the comparison.

Question 7

How much unified memory do I need for local LLM inference on M3 Pro?

Accepted Answer

M3 Pro ships with up to 36 GB of unified memory, which sets the ceiling on what fits: a 7B-class model at 4-bit needs roughly 6 GB, a 30B-class model at 4-bit needs roughly 18 GB, and a 70B-class model at 4-bit needs roughly 38 GB. Long-context workloads add KV-cache pressure on top of the weights — a 32k-token context on a 70B-class model adds another 8-16 GB depending on attention layout, so the practical fit is tighter than the weight-only number. For exact "did it actually run" evidence on M3 Pro, the leaderboard at /hw/m3-pro lists every model that has produced a signed result on this hardware.

Question 8

Is M3 Pro worth it for local LLM inference in 2026?

Accepted Answer

On signed data, M3 Pro delivers up to 286.5 decode tok/s on Qwen2.5-0.5B-Instruct-4bit via mlx (run r_akcbpx5vcqa), which puts it comfortably above interactive-coding thresholds (~80 tok/s) for the published top configuration. "Worth it" depends on your model class: M3 Pro is most useful for 7B-30B class local models with comfortable headroom. For "what fits and how fast", the per-model rows on /hw/m3-pro are the honest answer; for cross-rig comparisons, see /vs/m3-pro-vs-m3-max.

Question 9

What quantization should I use on M3 Pro?

Accepted Answer

On M3 Pro, the only quant with a signed local run is 4bit at 286.5 tok/s on Qwen2.5-0.5B-Instruct-4bit (run r_akcbpx5vcqa); a multi-quant comparison on M3 Pro is not yet published. Quant choice is a quality-vs-speed tradeoff that this hardware FAQ does not arbitrate; llm-speed publishes hardware-side speed, not output quality. For quality scores, see the model card on Hugging Face and the LMSYS Chatbot Arena.