Skip to content
llm-speed
Leaderboard/model/qwen3-235b-a22b

qwen3-235b-a22b

2 workload results across 1 hardware configuration.

Fastest known config

3.9 decode tok/s

on M3 Pro (18-core GPU) + 36GB unified via hosted-api see full run

M3 Pro (18-core GPU) + 36GB unifiedM3 Pro (18-core GPU) + 36GB unified

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shorthosted-api3.38tok/s0.56tok/s196,774msr_4-jcze6xnqo
chat-shorthosted-api3.90tok/s0.81tok/s135,268msr_8nhpq29n6gj

Community folklore

27 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

  • communityconfidence 70%

    10.00tok/s Qwen3-235B-A22B on RTX 3090 via llama.cpp

    ig fitting full 32k context in under 120 GB (V)RAM e.g. 24GB VRAM + 2x48GB DDR5 RAM. Just benchmarked over 140 tok/s prompt processing and 10 tok/s generation on my 3090TI FE + AMD 9950X 96GB RAM DDR5-6400 gaming rig (see comment for graph). Keep in mind this quant is \*not\* s…

    source: Reddit · u/VoidAlchemy · 2025-04-30

  • communityconfidence 70%

    140.0tok/s Qwen3-235B-A22B on RTX 3090 via llama.cpp

    and speed on a high end gaming rig fitting full 32k context in under 120 GB (V)RAM e.g. 24GB VRAM + 2x48GB DDR5 RAM. Just benchmarked over 140 tok/s prompt processing and 10 tok/s generation on my 3090TI FE + AMD 9950X 96GB RAM DDR5-6400 gaming rig (see comment for graph). Keep…

    source: Reddit · u/VoidAlchemy · 2025-04-30

  • communityconfidence 65%

    15.00tok/s Qwen3-235B-A22B via llama.cpp q8_0

    . By selectively offloading MoE tensors to the CPU - aiming to maximize the VRAM usage - I managed to run the model at generation rate of 15 tokens/s and a context window of 32k tokens. This token generation speed is really great for a non-reasoning model. Here is the full …

    source: Reddit · u/FalseMap1582 · 2025-07-24

  • communityconfidence 65%

    6.80tok/s Qwen3-235B-A22B on RTX 5060 Ti Q2_K

    n3 235b [Q2_K](https://huggingface.co/unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF/tree/main/Q2_K) on my 1500$ workstation and getting **6.5\~6.8 t/s** with 30K context - 115 token prompt - 1040 generated tokens Specs: 2x 16GB Nvidia (RTX 5060 Ti & P5000) + 64GB DDR5 6000Mhz + Int…

    source: Reddit · u/AdamDhahabi · 2025-07-23

  • communityconfidence 60%

    11.10tok/s Qwen3 235B-A22B on Ryzen AI MAX+ via llama.cpp

    Qwen3 235B-A22B on a Windows tablet @ ~11.1t/s on AMD Ryzen AI Max 395+ 128GB RAM (Radeon 8060S iGPU-only inference, using 87.7GB out of 95.8GB total for 'VRAM') The fact you can run th

    source: Reddit · u/Invuska · 2025-05-02

  • communityconfidence 60%

    11.10tok/s Qwen3 235B-A22B on Ryzen AI MAX+ via llama.cpp

    Qwen3 235B-A22B on a Windows tablet @ ~11.1t/s on AMD Ryzen AI Max 395+ 128GB RAM (Radeon 8060S iGPU-only inference, using 87.7GB out of 95.8GB total for 'VRAM') The fact you can run th

    source: Reddit · u/Invuska · 2025-05-02

  • communityconfidence 60%

    55.00tok/s Qwen3-235B-A22B on Fireworks AI via hosted-api

    or those interested in local deployment and performance trade-offs: 1️⃣ Qwen3-235B-A22B (via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsl…

    source: Reddit · u/ResearchCrafty1804 · 2025-05-07

  • communityconfidence 60%

    55.00tok/s Qwen3-235B-A22B on Fireworks AI via hosted-api

    or those interested in local deployment and performance trade-offs: 1️⃣ Qwen3-235B-A22B (via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsl…

    source: Reddit · u/ResearchCrafty1804 · 2025-05-07

  • communityconfidence 60%

    55.00tok/s Qwen3-235B-A22B on Fireworks AI via hosted-api

    or those interested in local deployment and performance trade-offs: 1️⃣ Qwen3-235B-A22B (via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsl…

    source: Reddit · u/ResearchCrafty1804 · 2025-05-07

  • communityconfidence 60%

    10.41tok/s Qwen3-235B-A22B on RTX 5090 via llama.cpp

    12,000 and 15,000 PLN Here are benchmarks of my system: Qwen2.5-72B-Instruct-Q6\_K - 9.14 t/s **Qwen3-235B-A22B-Q3\_K\_M** \- **10.41 t/s (maybe I should try Q4)** Llama-3.3-70B-Instruct-Q6\_K\_L - 11.03 t/s Qwen3-235B-A22B-Q2\_K - 14.77 t/s nvidia\_Llama-3\_3-N…

    source: Reddit · u/jacek2023 · 2025-05-17

See all 27 claims for qwen3-235b-a22b

qwen3-235b-a22b on hardware