qwen3-coder
13 workload results across 1 hardware configuration.
Fastest local config
109.9 decode tok/s
on M4 Max (40-core GPU) + 128GB unified via ollama (Q4_K_M) — see full run
Local runs (13 runs)
Runs from contributors' own machines via MLX, llama.cpp, vLLM, exllamav2, or ollama. Signed on the submitter's hardware.
M4 Max (40-core GPU) + 128GB unified
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | ollama@0.30.11 | Q4_K_M | 94.51tok/s | 65.06tok/s | 1,691ms | r_r0di2hkku1h |
| chat-long | ollama@0.30.11 | Q4_K_M | 97.35tok/s | 1,397.3tok/s | 2,252ms | r_r0di2hkku1h |
| concurrent-decode | ollama@0.30.11 | Q4_K_M | 109.9tok/s | no data | no data | r_r0di2hkku1h |
| agent-trace | ollama@0.30.11 | Q4_K_M | no data | no data | no data | r_r0di2hkku1h |
| chat-short | ollama@0.30.11 | Q4_K_M | 55.45tok/s | 675.2tok/s | 163ms | r_1o8q4lhgj88 |
| chat-long | ollama@0.30.11 | Q4_K_M | 57.86tok/s | 28,422.4tok/s | 111ms | r_1o8q4lhgj88 |
| concurrent-decode | ollama@0.30.11 | Q4_K_M | 67.01tok/s | no data | no data | r_1o8q4lhgj88 |
| agent-trace | ollama@0.30.11 | Q4_K_M | no data | no data | no data | r_1o8q4lhgj88 |
| chat-short | ollama@0.30.11 | Q4_K_M | 92.70tok/s | 719.0tok/s | 153ms | r_40h1dznuznk |
| chat-long | ollama@0.30.11 | Q4_K_M | 94.61tok/s | 1,396.5tok/s | 2,253ms | r_40h1dznuznk |
| concurrent-decode | ollama@0.30.11 | Q4_K_M | 109.1tok/s | no data | no data | r_40h1dznuznk |
| agent-trace | ollama@0.30.11 | Q4_K_M | no data | no data | no data | r_40h1dznuznk |
| chat-short | ollama@0.30.11 | Q4_K_M | 93.17tok/s | 13.30tok/s | 8,270ms | r_txvkjgoyyxs |
Community folklore
383 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.
- communityconfidence 85%
65.00tok/s — Qwen3-Coder-Next on RTX 3090 via vllm FP8
“Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly, affecting this model especially, that c…”
- communityconfidence 85%
71.00tok/s — Qwen3-Coder-Next on RTX 3090 via vllm FP8
“sted. I'm running Unsloth's Qwen3-Coder-Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly…”
- communityconfidence 85%
65.00tok/s — Qwen3-Coder-Next on RTX 3090 via vllm FP8
“Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly, affecting this model especially, that c…”
- communityconfidence 85%
71.00tok/s — Qwen3-Coder-Next on RTX 3090 via vllm FP8
“sted. I'm running Unsloth's Qwen3-Coder-Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly…”
- communityconfidence 85%
65.00tok/s — Qwen3-Coder-Next on RTX 3090 via vllm FP8
“Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly, affecting this model especially, that c…”
- communityconfidence 85%
71.00tok/s — Qwen3-Coder-Next on RTX 3090 via vllm FP8
“sted. I'm running Unsloth's Qwen3-Coder-Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly…”
- communityconfidence 85%
150.0tok/s — Qwen3-Coder on RTX 3090 via vllm FP16
“addition vRAM for a large enough context window for it to be useful for coding. For reference I’m able to run Qwen3-Coder-30B-A3B at Q8 at 150tok/s with a context window of 128k (FP16 KV) on dual 3090s. When using extensions like Cline you really need a large context window to b…”
- communityconfidence 85%
66.00tok/s — qwen3-coder on RTX 5080 via llama.cpp Q4_K_M
“I get around ~66 t/s (16k/32k context, Q4_K_M) with very similar but Notebook hardware: AMD Ryzen 9955HX3D 64GB DDR5-5600 Nvidia RTX 5080 Mobile 16”
- communityconfidence 75%
30.00tok/s — Qwen-3 Coder on RTX 3090 Ti via ollama FP16
“on: **FP16** * Hardware: **RTX 5090 + RTX 3090 Ti** * Task: code generation **Results:** * **llama.cpp:** \~52 tokens/sec * **Ollama:** \~30 tokens/sec Both runs use the same model weights and hardware. The gap is \~70% in favor of llama.cpp. Has anyone dug into why this happ…”
- communityconfidence 75%
52.00tok/s — Qwen-3 Coder on RTX 3090 Ti via llama.cpp FP16
“**Qwen-3 Coder 32B** * Precision: **FP16** * Hardware: **RTX 5090 + RTX 3090 Ti** * Task: code generation **Results:** * **llama.cpp:** \~52 tokens/sec * **Ollama:** \~30 tokens/sec Both runs use the same model weights and hardware. The gap is \~70% in favor of llama.cpp. Has…”