Skip to content
llm-speed

Qwen3-32B-4bit

2 workload results across 2 hardware configurations.

Fastest known config

34.4 decode tok/s

on M3 Ultra (60-core GPU) + 96GB unified via mlx see full run

M3 Pro (18-core GPU) + 36GB unifiedM3 Pro (18-core GPU) + 36GB unified

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.37.16tok/s25.65tok/s4,288msr_pnrrpcdqfo4

M3 Ultra (60-core GPU) + 96GB unifiedM3 Ultra (60-core GPU) + 96GB unified

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.334.41tok/s95.12tok/s1,156msr_anmmc80-aoq

Community folklore

61 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

  • communityconfidence 75%

    30.00tok/s qwen3 32B on RTX 3090 via vllm gptq

    Thank you for sharing! The difference for PP between 3090 vs M3MAX is remarkable. I thought 3090 would reach 30 t/s for TG, especially with tensor parallelism.  Oh actually, can you please share vLLM results? You can just share one data point if you don't

    source: Reddit · u/MLDataScientist · 2025-05-11

  • communityconfidence 75%

    30.00tok/s qwen3 32B on RTX 3090 via vllm gptq

    Thank you for sharing! The difference for PP between 3090 vs M3MAX is remarkable. I thought 3090 would reach 30 t/s for TG, especially with tensor parallelism.  Oh actually, can you please share vLLM results? You can just share one data point if you don't

    source: Reddit · u/MLDataScientist · 2025-05-11

  • communityconfidence 75%

    30.00tok/s qwen3 32B on RTX 3090 via vllm gptq

    Thank you for sharing! The difference for PP between 3090 vs M3MAX is remarkable. I thought 3090 would reach 30 t/s for TG, especially with tensor parallelism.  Oh actually, can you please share vLLM results? You can just share one data point if you don't

    source: Reddit · u/MLDataScientist · 2025-05-11

  • communityconfidence 75%

    100.0tok/s qwen 3 32b on RTX 3090 via vllm awq

    I have 2x3090 to run qwen 3 32b awq (4bit). Using vllm it can run ~80tok /s , using lmdeploy it much faster, maybe ~100tok/s. I like fast speed because iam using it as code agent. Fast speed definately help, it allow me to decide use the code or ask better code u

    source: Reddit · u/Kasatka06 · 2025-05-03

  • communityconfidence 75%

    80.00tok/s qwen 3 32b on RTX 3090 via vllm awq

    I have 2x3090 to run qwen 3 32b awq (4bit). Using vllm it can run ~80tok /s , using lmdeploy it much faster, maybe ~100tok/s. I like fast speed because iam using it as code agent. Fast speed definately help, it all

    source: Reddit · u/Kasatka06 · 2025-05-03

  • communityconfidence 65%

    40.00tok/s Qwen3 32b via llama.cpp gptq

    Note that you will get 40t/s for Qwen3 32b gptq 4bit with 4x tensor parallelism. Qwen3 235B Q4_1 will work with llama.cpp and 5xMI50 at 19t/s initially. But expect that

    source: Reddit · u/MLDataScientist · 2025-07-22

  • communityconfidence 60%

    180.0tok/s Qwen3-32B on M4 via lm-studio

    its 79.51% while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups. 5️⃣ The 0.6B micro-model races above 180 tok/s but tops out at 37.56% - that's why it's not even on the graph (50 % performance cut-off). All local runs were done with @lmstudi…

    source: Reddit · u/ResearchCrafty1804 · 2025-05-07

  • communityconfidence 60%

    45.00tok/s Qwen3-32B on Fireworks AI via hosted-api

    (via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsloth build is ~5x faster than Qwen's Qwen3-32B, which scores 82.20% as well yet crawls at …

    source: Reddit · u/ResearchCrafty1804 · 2025-05-07

  • communityconfidence 60%

    180.0tok/s Qwen3-32B on M4 via lm-studio

    its 79.51% while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups. 5️⃣ The 0.6B micro-model races above 180 tok/s but tops out at 37.56% - that's why it's not even on the graph (50 % performance cut-off). All local runs were done with @lmstudi…

    source: Reddit · u/ResearchCrafty1804 · 2025-05-07

  • communityconfidence 60%

    45.00tok/s Qwen3-32B on Fireworks AI via hosted-api

    (via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsloth build is ~5x faster than Qwen's Qwen3-32B, which scores 82.20% as well yet crawls at …

    source: Reddit · u/ResearchCrafty1804 · 2025-05-07

See all 61 claims for Qwen3-32B-4bit

Qwen3-32B-4bit on hardware