Qwen3-32B-4bit

Name: Qwen3-32B-4bit — community LLM benchmarks
Creator: llm-speed
License: https://www.apache.org/licenses/LICENSE-2.0
Keywords: Qwen3-32B-4bit, LLM benchmark, tokens per second, decode tok/s, prefill, TTFT

2 workload results across 2 hardware configurations.

Fastest known config

34.4 decode tok/s

on M3 Ultra (60-core GPU) + 96GB unified via mlx — see full run

M3 Pro (18-core GPU) + 36GB unified

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	7.16tok/s	25.65tok/s	4,288ms	r_pnrrpcdqfo4

M3 Ultra (60-core GPU) + 96GB unified

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	34.41tok/s	95.12tok/s	1,156ms	r_anmmc80-aoq

Community folklore

61 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

communityconfidence 75%
30.00tok/s — qwen3 32B on RTX 3090 via vllm gptq
“Thank you for sharing! The difference for PP between 3090 vs M3MAX is remarkable. I thought 3090 would reach 30 t/s for TG, especially with tensor parallelism. Oh actually, can you please share vLLM results? You can just share one data point if you don't”
source: Reddit · u/MLDataScientist · 2025-05-11
communityconfidence 75%
30.00tok/s — qwen3 32B on RTX 3090 via vllm gptq
“Thank you for sharing! The difference for PP between 3090 vs M3MAX is remarkable. I thought 3090 would reach 30 t/s for TG, especially with tensor parallelism. Oh actually, can you please share vLLM results? You can just share one data point if you don't”
source: Reddit · u/MLDataScientist · 2025-05-11
communityconfidence 75%
30.00tok/s — qwen3 32B on RTX 3090 via vllm gptq
“Thank you for sharing! The difference for PP between 3090 vs M3MAX is remarkable. I thought 3090 would reach 30 t/s for TG, especially with tensor parallelism. Oh actually, can you please share vLLM results? You can just share one data point if you don't”
source: Reddit · u/MLDataScientist · 2025-05-11
communityconfidence 75%
100.0tok/s — qwen 3 32b on RTX 3090 via vllm awq
“I have 2x3090 to run qwen 3 32b awq (4bit). Using vllm it can run ~80tok /s , using lmdeploy it much faster, maybe ~100tok/s. I like fast speed because iam using it as code agent. Fast speed definately help, it allow me to decide use the code or ask better code u”
source: Reddit · u/Kasatka06 · 2025-05-03
communityconfidence 75%
80.00tok/s — qwen 3 32b on RTX 3090 via vllm awq
“I have 2x3090 to run qwen 3 32b awq (4bit). Using vllm it can run ~80tok /s , using lmdeploy it much faster, maybe ~100tok/s. I like fast speed because iam using it as code agent. Fast speed definately help, it all”
source: Reddit · u/Kasatka06 · 2025-05-03
communityconfidence 65%
40.00tok/s — Qwen3 32b via llama.cpp gptq
“Note that you will get 40t/s for Qwen3 32b gptq 4bit with 4x tensor parallelism. Qwen3 235B Q4_1 will work with llama.cpp and 5xMI50 at 19t/s initially. But expect that”
source: Reddit · u/MLDataScientist · 2025-07-22
communityconfidence 60%
180.0tok/s — Qwen3-32B on M4 via lm-studio
“its 79.51% while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups. 5️⃣ The 0.6B micro-model races above 180 tok/s but tops out at 37.56% - that's why it's not even on the graph (50 % performance cut-off). All local runs were done with @lmstudi…”
source: Reddit · u/ResearchCrafty1804 · 2025-05-07
communityconfidence 60%
45.00tok/s — Qwen3-32B on Fireworks AI via hosted-api
“(via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsloth build is ~5x faster than Qwen's Qwen3-32B, which scores 82.20% as well yet crawls at …”
source: Reddit · u/ResearchCrafty1804 · 2025-05-07
communityconfidence 60%
180.0tok/s — Qwen3-32B on M4 via lm-studio
“its 79.51% while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups. 5️⃣ The 0.6B micro-model races above 180 tok/s but tops out at 37.56% - that's why it's not even on the graph (50 % performance cut-off). All local runs were done with @lmstudi…”
source: Reddit · u/ResearchCrafty1804 · 2025-05-07
communityconfidence 60%
45.00tok/s — Qwen3-32B on Fireworks AI via hosted-api
“(via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsloth build is ~5x faster than Qwen's Qwen3-32B, which scores 82.20% as well yet crawls at …”
source: Reddit · u/ResearchCrafty1804 · 2025-05-07

See all 61 claims for Qwen3-32B-4bit →

M3 Pro (18-core GPU) + 36GB unifiedM3 Pro (18-core GPU) + 36GB unified

M3 Ultra (60-core GPU) + 96GB unifiedM3 Ultra (60-core GPU) + 96GB unified

Community folklore

Qwen3-32B-4bit on hardware

M3 Pro (18-core GPU) + 36GB unified

M3 Ultra (60-core GPU) + 96GB unified