M3 Pro (18-core GPU) LLM benchmark

The fastest LLM measured on the M3 Pro (18-core GPU) is Qwen2.5-0.5B-Instruct-4bit at 286.5 decode tok/s via mlx (signed run). Across 11 reproducible runs on 9 models, this page lists decode tok/s, prefill, and TTFT for each, every number linking to the run it came from.

Fastest known config on M3 Pro (18-core GPU)

286.5 decode tok/s

Qwen2.5-0.5B-Instruct-4bit via mlx (4bit). see full run

Qwen3-32B-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	-	7.16tok/s	25.65tok/s	4,288ms	r_pnrrpcdqfo4

Qwen2.5-32B-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	-	7.05tok/s	33.41tok/s	3,921ms	r_f4x3xaan2ay

gemma-2-9b-it-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	-	21.97tok/s	102.5tok/s	1,151ms	r_q7t7b7dcuz5

Qwen2.5-14B-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	4bit	14.41tok/s	117.9tok/s	1,111ms	r_ia73dzeue0b

phi-4-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	-	15.61tok/s	120.7tok/s	936ms	r_-w8hnn61va_

Llama-3.1-8B-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	-	29.20tok/s	203.3tok/s	669ms	r_h0-use1ypnb

Qwen2.5-7B-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	4bit	30.52tok/s	161.9tok/s	809ms	r_llzv_g-ymaf
chat-short	mlx@0.31.3	4bit	15.67tok/s	54.34tok/s	2,411ms	r_e3t93rscswq

stable-code-instruct-3b-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	-	19.37tok/s	131.3tok/s	967ms	r_pqjsvd-cub4

Qwen2.5-0.5B-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	4bit	286.5tok/s	656.2tok/s	200ms	r_akcbpx5vcqa
chat-short	mlx@0.31.3	4bit	282.6tok/s	668.8tok/s	196ms	r_bftqtkilvoe

Models measured on M3 Pro (18-core GPU)

Common questions about M3 Pro (18-core GPU)

Direct Q&A drawn from the runs above: fastest LLM, supported model classes, backend rankings, quantization guidance.

Read the M3 Pro (18-core GPU) FAQ →