qwen3-coder

13 workload results across 1 hardware configuration.

Fastest local config

109.9 decode tok/s

on M4 Max (40-core GPU) + 128GB unified via ollama (Q4_K_M) — see full run

Local runs (13 runs)

Runs from contributors' own machines via MLX, llama.cpp, vLLM, exllamav2, or ollama. Signed on the submitter's hardware.

M4 Max (40-core GPU) + 128GB unified

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	ollama@0.30.11	Q4_K_M	94.51tok/s	65.06tok/s	1,691ms	r_r0di2hkku1h
chat-long	ollama@0.30.11	Q4_K_M	97.35tok/s	1,397.3tok/s	2,252ms	r_r0di2hkku1h
concurrent-decode	ollama@0.30.11	Q4_K_M	109.9tok/s	no data	no data	r_r0di2hkku1h
agent-trace	ollama@0.30.11	Q4_K_M	no data	no data	no data	r_r0di2hkku1h
chat-short	ollama@0.30.11	Q4_K_M	55.45tok/s	675.2tok/s	163ms	r_1o8q4lhgj88
chat-long	ollama@0.30.11	Q4_K_M	57.86tok/s	28,422.4tok/s	111ms	r_1o8q4lhgj88
concurrent-decode	ollama@0.30.11	Q4_K_M	67.01tok/s	no data	no data	r_1o8q4lhgj88
agent-trace	ollama@0.30.11	Q4_K_M	no data	no data	no data	r_1o8q4lhgj88
chat-short	ollama@0.30.11	Q4_K_M	92.70tok/s	719.0tok/s	153ms	r_40h1dznuznk
chat-long	ollama@0.30.11	Q4_K_M	94.61tok/s	1,396.5tok/s	2,253ms	r_40h1dznuznk
concurrent-decode	ollama@0.30.11	Q4_K_M	109.1tok/s	no data	no data	r_40h1dznuznk
agent-trace	ollama@0.30.11	Q4_K_M	no data	no data	no data	r_40h1dznuznk
chat-short	ollama@0.30.11	Q4_K_M	93.17tok/s	13.30tok/s	8,270ms	r_txvkjgoyyxs

Community folklore

383 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

communityconfidence 85%
65.00tok/s — Qwen3-Coder-Next on RTX 3090 via vllm FP8
“Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly, affecting this model especially, that c…”
source: Reddit · u/rmhubbert · 2026-03-04
communityconfidence 85%
71.00tok/s — Qwen3-Coder-Next on RTX 3090 via vllm FP8
“sted. I'm running Unsloth's Qwen3-Coder-Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly…”
source: Reddit · u/rmhubbert · 2026-03-04
communityconfidence 85%
65.00tok/s — Qwen3-Coder-Next on RTX 3090 via vllm FP8
“Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly, affecting this model especially, that c…”
source: Reddit · u/rmhubbert · 2026-03-04
communityconfidence 85%
71.00tok/s — Qwen3-Coder-Next on RTX 3090 via vllm FP8
“sted. I'm running Unsloth's Qwen3-Coder-Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly…”
source: Reddit · u/rmhubbert · 2026-03-04
communityconfidence 85%
65.00tok/s — Qwen3-Coder-Next on RTX 3090 via vllm FP8
“Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly, affecting this model especially, that c…”
source: Reddit · u/rmhubbert · 2026-03-04
communityconfidence 85%
71.00tok/s — Qwen3-Coder-Next on RTX 3090 via vllm FP8
“sted. I'm running Unsloth's Qwen3-Coder-Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly…”
source: Reddit · u/rmhubbert · 2026-03-04
communityconfidence 85%
150.0tok/s — Qwen3-Coder on RTX 3090 via vllm FP16
“addition vRAM for a large enough context window for it to be useful for coding. For reference I’m able to run Qwen3-Coder-30B-A3B at Q8 at 150tok/s with a context window of 128k (FP16 KV) on dual 3090s. When using extensions like Cline you really need a large context window to b…”
source: Reddit · u/Tyme4Trouble · 2025-08-05
communityconfidence 85%
66.00tok/s — qwen3-coder on RTX 5080 via llama.cpp Q4_K_M
“I get around ~66 t/s (16k/32k context, Q4_K_M) with very similar but Notebook hardware: AMD Ryzen 9955HX3D 64GB DDR5-5600 Nvidia RTX 5080 Mobile 16”
source: Reddit · u/Danmoreng · 2026-02-25
communityconfidence 75%
30.00tok/s — Qwen-3 Coder on RTX 3090 Ti via ollama FP16
“on: **FP16** * Hardware: **RTX 5090 + RTX 3090 Ti** * Task: code generation **Results:** * **llama.cpp:** \~52 tokens/sec * **Ollama:** \~30 tokens/sec Both runs use the same model weights and hardware. The gap is \~70% in favor of llama.cpp. Has anyone dug into why this happ…”
source: Reddit · u/Shoddy_Bed3240 · 2026-01-07
communityconfidence 75%
52.00tok/s — Qwen-3 Coder on RTX 3090 Ti via llama.cpp FP16
“**Qwen-3 Coder 32B** * Precision: **FP16** * Hardware: **RTX 5090 + RTX 3090 Ti** * Task: code generation **Results:** * **llama.cpp:** \~52 tokens/sec * **Ollama:** \~30 tokens/sec Both runs use the same model weights and hardware. The gap is \~70% in favor of llama.cpp. Has…”
source: Reddit · u/Shoddy_Bed3240 · 2026-01-07

See all 383 claims for qwen3-coder →

qwen3-coder on hardware

M4 Max (40-core GPU) LLM benchmarks

M4 Max (40-core GPU) + 128GB unifiedM4 Max (40-core GPU) + 128GB unified

Community folklore

qwen3-coder on hardware

M4 Max (40-core GPU) + 128GB unified