Cheatsheet

Local LLM speed cheatsheet: decode tok/s by GPU & Mac

Name: LLM speed cheatsheet — every model x every rig
Creator: llm-speed
License: https://creativecommons.org/licenses/by/4.0/

Best decode tokens-per-second per (model × hardware) tuple measured by llm-speed under suite-v1. Numbers are wall-clock, batch size 1 unless the workload says otherwise. Each row links to the canonical run page; cite as llm-speed.com/r/<id>.

33 (model × hardware) combinations · sorted by decode tok/s · suite-v1

Model	Quant	Hardware	Backend	Workload	decode tok/s	prefill tok/s	TTFT (ms)	Run
mlx-community/Qwen3-0.6B-4bit	-	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	429.2	1185.2	92.8	r_6rgao6ao735
stable-code-instruct-3b	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	concurrent-decode	356.1	—	—	r_q9f15lz6831
gpt-oss-20b	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	chat-short	318.4	—	172.5	r_b9ul-vxh9sc
DeepSeek-Coder-V2-Lite-Instruct	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	chat-short	309.5	—	267.6	r_0_gs1rgl2fl
Qwen3-Coder-30B-A3B-Instruct	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	chat-short	259.9	—	218.4	r_c7qyvvmmsv1
Qwen2.5-Coder-7B-Instruct	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	chat-short	255.3	—	35.4	r_2b9o6y_49mi
Qwen2.5-7B-Instruct	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	chat-short	246.7	—	33.9	r_3yn-4321hp-
Llama-3.1-8B-Instruct	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	chat-short	232.2	—	33.8	r_kfrkg-vn376
Yi-Coder-9B-Chat	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	chat-short	199.3	—	36.2	r_u4iojm6-ekg
gemma3	Q4_K_M	RTX 4090 (24GB) + AMD EPYC 7443 24-Core Processor (24c) + 252GB	ollama@0.31.1	chat-short	195.0	167.2	705.6	r_dlanfbgym0h
deepseek-coder-v2	Q4_0	RTX 3090 (24GB) + AMD EPYC 7702P 64-Core Processor (64c) + 252GB	ollama@0.31.1	chat-short	189.5	159.2	734.9	r_o2-1w665rtq
mlx-community/Qwen3-4B-Instruct-2507-4bit	-	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	182.7	636.6	172.8	r_p2rpb_0iyjj
qwen3-coder	Q4_K_M	RTX 4090 (24GB) + AMD EPYC 7352 24-Core Processor (24c) + 252GB	ollama@0.31.1	concurrent-decode	179.9	—	—	r_wjq32z47vlp
qwen2.5-coder	Q4_K_M	RTX 4090 (48GB) + AMD EPYC 7763 64-Core Processor (128c) + 1008GB	ollama@0.31.1	concurrent-decode	161.1	—	—	r_mv8n8k9wu1e
llama3.1	Q4_K_M	RTX 4090 (48GB) + AMD EPYC 7763 64-Core Processor (128c) + 1008GB	ollama@0.31.1	chat-short	154.4	328.0	335.4	r_h1ub_1uxzdh
gpt-oss	MXFP4	RTX 4090 (24GB) + AMD EPYC 7352 24-Core Processor (24c) + 252GB	ollama@0.31.1	chat-long	141.7	631.5	5048.2	r_iu2sfa9ykvw
phi-4	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	concurrent-decode	140.2	—	—	r_k1u_k1j_1i2
Qwen2.5-Coder-14B-Instruct	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	chat-short	140.0	—	42.1	r_opuj21f13-_
qwen2.5-coder	Q4_K_M	RTX 3090 (24GB) + AMD EPYC 7663 56-Core Processor (56c) + 252GB	ollama@0.31.1	concurrent-decode	139.2	—	—	r_pb0jpnbujji
llama3.1	Q4_K_M	RTX 3090 (24GB) + AMD EPYC 7663 56-Core Processor (56c) + 252GB	ollama@0.31.1	chat-short	136.2	2.2	49960.0	r_9hurqggbshk
deepseek-r1	Q4_K_M	RTX 4090 (24GB) + AMD EPYC 75F3 32-Core Processor (64c) + 504GB	ollama@0.31.1	agent-trace	133.8	8852.4	366.2	r_fg77v2hhohb
Qwen2.5-14B-Instruct	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	chat-short	133.3	—	55.9	r_tj9bu7gvnvh
glm-4.7-flash	Q4_K_M	RTX 4090 (24GB) + AMD EPYC 75F3 32-Core Processor (64c) + 504GB	ollama@0.31.1	agent-trace	129.9	2491.1	1262.5	r_o636l3cc-rr
mlx-community/Qwen3-8B-4bit	-	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	120.5	446.1	246.6	r_b1awwhkoo2v
lmstudio-community/DeepSeek-R1-0528-Qwen3-8B-MLX-4bit	-	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	116.8	424.5	247.4	r_yy_6jfx70jq
mlx-community/Qwen3-30B-A3B-Instruct-2507-4bit	-	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	112.5	308.7	356.3	r_o6wbmfegmde
Codestral-22B-v0.1	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	concurrent-decode	100.3	—	—	r_4q040m4scic
Qwen2.5-32B-Instruct	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	chat-short	71.9	—	115.2	r_twfs86tf_xf
Qwen2.5-Coder-32B-Instruct	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	concurrent-decode	71.0	—	—	r_v983y0y3r2u
Qwen3-32B	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	chat-short	69.5	—	327.1	r_phvxm9dcak0
gemma-2-9b-it	-	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	chat-short	69.5	—	324.6	r_1_xl4zb5-xj
qwen2.5-coder	Q4_K_M	RTX 3090 (24GB) + AMD EPYC 7702P 64-Core Processor (64c) + 252GB	ollama@0.31.1	chat-short	69.2	402.0	325.8	r_6ahy-dq0f_0
qwen3.6	Q4_K_M	RTX 4090 (48GB) + AMD EPYC 7763 64-Core Processor (128c) + 1008GB	ollama@0.31.1	agent-trace	44.2	1910.5	1807.9	r_h_659oy695r

How to read this table

Each row is the highest decode tokens-per-second measured for a unique (model, hardware) pair. When more than one workload was run, we pick the workload that produced the headline number. Lower-tps duplicates from the same machine are not shown; click through to the run page for the full per-workload breakdown.

decode tok/s is wall-clock streaming throughput (memory-bandwidth-bound). prefill tok/s is prompt ingestion throughput (compute-bound). TTFT is time-to-first-token in milliseconds. See /glossary for definitions and /methodology for the workload spec.