Qwen3-32B-4bit
2 workload results across 2 hardware configurations.
Fastest known config
34.4 decode tok/s
on M3 Ultra (60-core GPU) + 96GB unified via mlx — see full run
M3 Pro (18-core GPU) + 36GB unified
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 7.16tok/s | 25.65tok/s | 4,288ms | r_pnrrpcdqfo4 |
M3 Ultra (60-core GPU) + 96GB unified
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 34.41tok/s | 95.12tok/s | 1,156ms | r_anmmc80-aoq |
Community folklore
61 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.
- communityconfidence 75%
30.00tok/s — qwen3 32B on RTX 3090 via vllm gptq
“Thank you for sharing! The difference for PP between 3090 vs M3MAX is remarkable. I thought 3090 would reach 30 t/s for TG, especially with tensor parallelism. Oh actually, can you please share vLLM results? You can just share one data point if you don't”
- communityconfidence 75%
30.00tok/s — qwen3 32B on RTX 3090 via vllm gptq
“Thank you for sharing! The difference for PP between 3090 vs M3MAX is remarkable. I thought 3090 would reach 30 t/s for TG, especially with tensor parallelism. Oh actually, can you please share vLLM results? You can just share one data point if you don't”
- communityconfidence 75%
30.00tok/s — qwen3 32B on RTX 3090 via vllm gptq
“Thank you for sharing! The difference for PP between 3090 vs M3MAX is remarkable. I thought 3090 would reach 30 t/s for TG, especially with tensor parallelism. Oh actually, can you please share vLLM results? You can just share one data point if you don't”
- communityconfidence 75%
100.0tok/s — qwen 3 32b on RTX 3090 via vllm awq
“I have 2x3090 to run qwen 3 32b awq (4bit). Using vllm it can run ~80tok /s , using lmdeploy it much faster, maybe ~100tok/s. I like fast speed because iam using it as code agent. Fast speed definately help, it allow me to decide use the code or ask better code u”
- communityconfidence 75%
80.00tok/s — qwen 3 32b on RTX 3090 via vllm awq
“I have 2x3090 to run qwen 3 32b awq (4bit). Using vllm it can run ~80tok /s , using lmdeploy it much faster, maybe ~100tok/s. I like fast speed because iam using it as code agent. Fast speed definately help, it all”
- communityconfidence 65%
40.00tok/s — Qwen3 32b via llama.cpp gptq
“Note that you will get 40t/s for Qwen3 32b gptq 4bit with 4x tensor parallelism. Qwen3 235B Q4_1 will work with llama.cpp and 5xMI50 at 19t/s initially. But expect that”
- communityconfidence 60%
180.0tok/s — Qwen3-32B on M4 via lm-studio
“its 79.51% while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups. 5️⃣ The 0.6B micro-model races above 180 tok/s but tops out at 37.56% - that's why it's not even on the graph (50 % performance cut-off). All local runs were done with @lmstudi…”
- communityconfidence 60%
45.00tok/s — Qwen3-32B on Fireworks AI via hosted-api
“(via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsloth build is ~5x faster than Qwen's Qwen3-32B, which scores 82.20% as well yet crawls at …”
- communityconfidence 60%
180.0tok/s — Qwen3-32B on M4 via lm-studio
“its 79.51% while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups. 5️⃣ The 0.6B micro-model races above 180 tok/s but tops out at 37.56% - that's why it's not even on the graph (50 % performance cut-off). All local runs were done with @lmstudi…”
- communityconfidence 60%
45.00tok/s — Qwen3-32B on Fireworks AI via hosted-api
“(via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsloth build is ~5x faster than Qwen's Qwen3-32B, which scores 82.20% as well yet crawls at …”