qwen3-235b-a22b
2 workload results across 1 hardware configuration.
Fastest known config
3.9 decode tok/s
on M3 Pro (18-core GPU) + 36GB unified via hosted-api — see full run
M3 Pro (18-core GPU) + 36GB unified
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | hosted-api | — | 3.38tok/s | 0.56tok/s | 196,774ms | r_4-jcze6xnqo |
| chat-short | hosted-api | — | 3.90tok/s | 0.81tok/s | 135,268ms | r_8nhpq29n6gj |
Community folklore
27 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.
- communityconfidence 70%
10.00tok/s — Qwen3-235B-A22B on RTX 3090 via llama.cpp
“ig fitting full 32k context in under 120 GB (V)RAM e.g. 24GB VRAM + 2x48GB DDR5 RAM. Just benchmarked over 140 tok/s prompt processing and 10 tok/s generation on my 3090TI FE + AMD 9950X 96GB RAM DDR5-6400 gaming rig (see comment for graph). Keep in mind this quant is \*not\* s…”
- communityconfidence 70%
140.0tok/s — Qwen3-235B-A22B on RTX 3090 via llama.cpp
“and speed on a high end gaming rig fitting full 32k context in under 120 GB (V)RAM e.g. 24GB VRAM + 2x48GB DDR5 RAM. Just benchmarked over 140 tok/s prompt processing and 10 tok/s generation on my 3090TI FE + AMD 9950X 96GB RAM DDR5-6400 gaming rig (see comment for graph). Keep…”
- communityconfidence 65%
15.00tok/s — Qwen3-235B-A22B via llama.cpp q8_0
“. By selectively offloading MoE tensors to the CPU - aiming to maximize the VRAM usage - I managed to run the model at generation rate of 15 tokens/s and a context window of 32k tokens. This token generation speed is really great for a non-reasoning model. Here is the full …”
- communityconfidence 65%
6.80tok/s — Qwen3-235B-A22B on RTX 5060 Ti Q2_K
“n3 235b [Q2_K](https://huggingface.co/unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF/tree/main/Q2_K) on my 1500$ workstation and getting **6.5\~6.8 t/s** with 30K context - 115 token prompt - 1040 generated tokens Specs: 2x 16GB Nvidia (RTX 5060 Ti & P5000) + 64GB DDR5 6000Mhz + Int…”
- communityconfidence 60%
11.10tok/s — Qwen3 235B-A22B on Ryzen AI MAX+ via llama.cpp
“Qwen3 235B-A22B on a Windows tablet @ ~11.1t/s on AMD Ryzen AI Max 395+ 128GB RAM (Radeon 8060S iGPU-only inference, using 87.7GB out of 95.8GB total for 'VRAM') The fact you can run th”
- communityconfidence 60%
11.10tok/s — Qwen3 235B-A22B on Ryzen AI MAX+ via llama.cpp
“Qwen3 235B-A22B on a Windows tablet @ ~11.1t/s on AMD Ryzen AI Max 395+ 128GB RAM (Radeon 8060S iGPU-only inference, using 87.7GB out of 95.8GB total for 'VRAM') The fact you can run th”
- communityconfidence 60%
55.00tok/s — Qwen3-235B-A22B on Fireworks AI via hosted-api
“or those interested in local deployment and performance trade-offs: 1️⃣ Qwen3-235B-A22B (via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsl…”
- communityconfidence 60%
55.00tok/s — Qwen3-235B-A22B on Fireworks AI via hosted-api
“or those interested in local deployment and performance trade-offs: 1️⃣ Qwen3-235B-A22B (via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsl…”
- communityconfidence 60%
55.00tok/s — Qwen3-235B-A22B on Fireworks AI via hosted-api
“or those interested in local deployment and performance trade-offs: 1️⃣ Qwen3-235B-A22B (via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsl…”
- communityconfidence 60%
10.41tok/s — Qwen3-235B-A22B on RTX 5090 via llama.cpp
“12,000 and 15,000 PLN Here are benchmarks of my system: Qwen2.5-72B-Instruct-Q6\_K - 9.14 t/s **Qwen3-235B-A22B-Q3\_K\_M** \- **10.41 t/s (maybe I should try Q4)** Llama-3.3-70B-Instruct-Q6\_K\_L - 11.03 t/s Qwen3-235B-A22B-Q2\_K - 14.77 t/s nvidia\_Llama-3\_3-N…”