Skip to content
llm-speed

Hardware FAQ

M3 Pro — frequently asked questions

Direct answers to the questions local-LLM enthusiasts ask about M3 Pro, drawn from 48 signed runs on llm-speed. Every numerical claim links to a verifiable run permalink at /r/<id>.

What's the fastest LLM on M3 Pro?

The fastest measured LLM on M3 Pro on llm-speed is Qwen2.5-0.5B-Instruct-4bit at 286.5 decode tok/s on mlx 0.31.3 (workload chat-short, run r_akcbpx5vcqa). Cite as https://llm-speed.com/r/r_akcbpx5vcqa.

This is the headline decode tokens-per-second across every (model, backend) pairing submitted on M3 Pro; faster results may exist on hardware not yet benchmarked, but among signed runs this is the published top.

Run r_akcbpx5vcqa · M3 Pro leaderboard

Can I run a 7B-class model on M3 Pro?

Yes — llm-speed has measured Qwen2.5-7B-Instruct-4bit at 30.5 decode tok/s on mlx 0.31.3 (workload chat-short, run r_llzv_g-ymaf) on M3 Pro, confirming the 7-9B parameter range fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_llzv_g-ymaf.

7B-class models on M3 Pro typically use 4bit; check the run for the exact backend version and model digest.

Run r_llzv_g-ymaf · Every model on M3 Pro

Can I run a 30B-class model on M3 Pro?

Yes — llm-speed has measured Qwen3-32B-4bit at 7.2 decode tok/s on mlx 0.31.3 (workload chat-short, run r_pnrrpcdqfo4) on M3 Pro, confirming the 27-33B parameter range, including MoE 30B-A3B variants fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_pnrrpcdqfo4.

30B-class models on M3 Pro typically use the published quant scheme on the run page; check the run for the exact backend version and model digest.

Run r_pnrrpcdqfo4 · Every model on M3 Pro

Can I run a 70B-class model on M3 Pro?

Yes — llm-speed has measured llama-3.3-70b-instruct at 43.2 decode tok/s on hosted-api (workload chat-short, run r_tthgrsb7zn5) on M3 Pro, confirming the 70-72B parameter range fits and runs end-to-end on this hardware. Cite as https://llm-speed.com/r/r_tthgrsb7zn5.

70B-class models on M3 Pro typically use the published quant scheme on the run page; check the run for the exact backend version and model digest.

Run r_tthgrsb7zn5 · Every model on M3 Pro

How does M3 Pro compare to M3 Max for local LLM inference?

For a head-to-head between M3 Pro and M3 Max, see the side-by-side comparison page at https://llm-speed.com/vs/m3-pro-vs-m3-max, which lays out every (model, backend) pair where both rigs have a signed run.

As a single-rig anchor, M3 Pro tops out at 286.5 decode tok/s on Qwen2.5-0.5B-Instruct-4bit via mlx (run r_akcbpx5vcqa); the M3 Max top number is on its own /hw/<slug> page so the comparison stays grounded in measured numbers, not extrapolation.

M3 Pro vs M3 Max (side-by-side) · M3 Max leaderboard · M3 Pro leaderboard

Which backend is fastest on M3 Pro?

The only backend with a signed local run on M3 Pro so far is mlx, with a top result of 286.5 decode tok/s on Qwen2.5-0.5B-Instruct-4bit (run r_akcbpx5vcqa); a multi-backend comparison on M3 Pro is not yet published.

Submit a competing run with "llm-speed bench --backends <other-backend>" on a M3 Pro machine to populate the comparison.

Run r_akcbpx5vcqa · M3 Pro leaderboard

How much unified memory do I need for local LLM inference on M3 Pro?

M3 Pro ships with up to 36 GB of unified memory, which sets the ceiling on what fits: a 7B-class model at 4-bit needs roughly 6 GB, a 30B-class model at 4-bit needs roughly 18 GB, and a 70B-class model at 4-bit needs roughly 38 GB.

Long-context workloads add KV-cache pressure on top of the weights — a 32k-token context on a 70B-class model adds another 8-16 GB depending on attention layout, so the practical fit is tighter than the weight-only number.

For exact "did it actually run" evidence on M3 Pro, the leaderboard at /hw/m3-pro lists every model that has produced a signed result on this hardware.

Models successfully run on M3 Pro

Is M3 Pro worth it for local LLM inference in 2026?

On signed data, M3 Pro delivers up to 286.5 decode tok/s on Qwen2.5-0.5B-Instruct-4bit via mlx (run r_akcbpx5vcqa), which puts it comfortably above interactive-coding thresholds (~80 tok/s) for the published top configuration.

"Worth it" depends on your model class: M3 Pro is most useful for 7B-30B class local models with comfortable headroom.

For "what fits and how fast", the per-model rows on /hw/m3-pro are the honest answer; for cross-rig comparisons, see /vs/m3-pro-vs-m3-max.

M3 Pro leaderboard · M3 Pro vs M3 Max

What quantization should I use on M3 Pro?

On M3 Pro, the only quant with a signed local run is 4bit at 286.5 tok/s on Qwen2.5-0.5B-Instruct-4bit (run r_akcbpx5vcqa); a multi-quant comparison on M3 Pro is not yet published.

Quant choice is a quality-vs-speed tradeoff that this hardware FAQ does not arbitrate; llm-speed publishes hardware-side speed, not output quality. For quality scores, see the model card on Hugging Face and the LMSYS Chatbot Arena.

4bit: 286.5 tok/s (run r_akcbpx5vcqa)