Leaderboard/model/qwen3-235b-a22b

qwen3-235b-a22b

Name: qwen3-235b-a22b — community LLM benchmarks
Creator: llm-speed
License: https://www.apache.org/licenses/LICENSE-2.0
Keywords: qwen3-235b-a22b, LLM benchmark, tokens per second, decode tok/s, prefill, TTFT

2 workload results across 1 hardware configuration.

Fastest known config

3.9 decode tok/s

on M3 Pro (18-core GPU) + 36GB unified via hosted-api — see full run

M3 Pro (18-core GPU) + 36GB unified

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	hosted-api	—	3.38tok/s	0.56tok/s	196,774ms	r_4-jcze6xnqo
chat-short	hosted-api	—	3.90tok/s	0.81tok/s	135,268ms	r_8nhpq29n6gj

Community folklore

27 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

communityconfidence 70%
10.00tok/s — Qwen3-235B-A22B on RTX 3090 via llama.cpp
“ig fitting full 32k context in under 120 GB (V)RAM e.g. 24GB VRAM + 2x48GB DDR5 RAM. Just benchmarked over 140 tok/s prompt processing and 10 tok/s generation on my 3090TI FE + AMD 9950X 96GB RAM DDR5-6400 gaming rig (see comment for graph). Keep in mind this quant is \*not\* s…”
source: Reddit · u/VoidAlchemy · 2025-04-30
communityconfidence 70%
140.0tok/s — Qwen3-235B-A22B on RTX 3090 via llama.cpp
“and speed on a high end gaming rig fitting full 32k context in under 120 GB (V)RAM e.g. 24GB VRAM + 2x48GB DDR5 RAM. Just benchmarked over 140 tok/s prompt processing and 10 tok/s generation on my 3090TI FE + AMD 9950X 96GB RAM DDR5-6400 gaming rig (see comment for graph). Keep…”
source: Reddit · u/VoidAlchemy · 2025-04-30
communityconfidence 65%
15.00tok/s — Qwen3-235B-A22B via llama.cpp q8_0
“. By selectively offloading MoE tensors to the CPU - aiming to maximize the VRAM usage - I managed to run the model at generation rate of 15 tokens/s and a context window of 32k tokens. This token generation speed is really great for a non-reasoning model. Here is the full …”
source: Reddit · u/FalseMap1582 · 2025-07-24
communityconfidence 65%
6.80tok/s — Qwen3-235B-A22B on RTX 5060 Ti Q2_K
“n3 235b [Q2_K](https://huggingface.co/unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF/tree/main/Q2_K) on my 1500$ workstation and getting **6.5\~6.8 t/s** with 30K context - 115 token prompt - 1040 generated tokens Specs: 2x 16GB Nvidia (RTX 5060 Ti & P5000) + 64GB DDR5 6000Mhz + Int…”
source: Reddit · u/AdamDhahabi · 2025-07-23
communityconfidence 60%
11.10tok/s — Qwen3 235B-A22B on Ryzen AI MAX+ via llama.cpp
“Qwen3 235B-A22B on a Windows tablet @ ~11.1t/s on AMD Ryzen AI Max 395+ 128GB RAM (Radeon 8060S iGPU-only inference, using 87.7GB out of 95.8GB total for 'VRAM') The fact you can run th”
source: Reddit · u/Invuska · 2025-05-02
communityconfidence 60%
11.10tok/s — Qwen3 235B-A22B on Ryzen AI MAX+ via llama.cpp
“Qwen3 235B-A22B on a Windows tablet @ ~11.1t/s on AMD Ryzen AI Max 395+ 128GB RAM (Radeon 8060S iGPU-only inference, using 87.7GB out of 95.8GB total for 'VRAM') The fact you can run th”
source: Reddit · u/Invuska · 2025-05-02
communityconfidence 60%
55.00tok/s — Qwen3-235B-A22B on Fireworks AI via hosted-api
“or those interested in local deployment and performance trade-offs: 1️⃣ Qwen3-235B-A22B (via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsl…”
source: Reddit · u/ResearchCrafty1804 · 2025-05-07
communityconfidence 60%
55.00tok/s — Qwen3-235B-A22B on Fireworks AI via hosted-api
“or those interested in local deployment and performance trade-offs: 1️⃣ Qwen3-235B-A22B (via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsl…”
source: Reddit · u/ResearchCrafty1804 · 2025-05-07
communityconfidence 60%
55.00tok/s — Qwen3-235B-A22B on Fireworks AI via hosted-api
“or those interested in local deployment and performance trade-offs: 1️⃣ Qwen3-235B-A22B (via Fireworks API) tops the table at 83.66% with ~55 tok/s. 2️⃣ But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend. 3️⃣ The same Unsl…”
source: Reddit · u/ResearchCrafty1804 · 2025-05-07
communityconfidence 60%
10.41tok/s — Qwen3-235B-A22B on RTX 5090 via llama.cpp
“12,000 and 15,000 PLN Here are benchmarks of my system: Qwen2.5-72B-Instruct-Q6\_K - 9.14 t/s **Qwen3-235B-A22B-Q3\_K\_M** \- **10.41 t/s (maybe I should try Q4)** Llama-3.3-70B-Instruct-Q6\_K\_L - 11.03 t/s Qwen3-235B-A22B-Q2\_K - 14.77 t/s nvidia\_Llama-3\_3-N…”
source: Reddit · u/jacek2023 · 2025-05-17

See all 27 claims for qwen3-235b-a22b →

qwen3-235b-a22b on hardware

M3 Pro (18-core GPU) LLM benchmarks

M3 Pro (18-core GPU) + 36GB unifiedM3 Pro (18-core GPU) + 36GB unified

Community folklore

qwen3-235b-a22b on hardware

M3 Pro (18-core GPU) + 36GB unified