Skip to content
llm-speed
Leaderboard/models/qwen3-coder

qwen3-coder

13 workload results across 1 hardware configuration.

Fastest local config

109.9 decode tok/s

on M4 Max (40-core GPU) + 128GB unified via ollama (Q4_K_M) see full run

Local runs (13 runs)

Runs from contributors' own machines via MLX, llama.cpp, vLLM, exllamav2, or ollama. Signed on the submitter's hardware.

M4 Max (40-core GPU) + 128GB unifiedM4 Max (40-core GPU) + 128GB unified

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortollama@0.30.11Q4_K_M94.51tok/s65.06tok/s1,691msr_r0di2hkku1h
chat-longollama@0.30.11Q4_K_M97.35tok/s1,397.3tok/s2,252msr_r0di2hkku1h
concurrent-decodeollama@0.30.11Q4_K_M109.9tok/sno datano datar_r0di2hkku1h
agent-traceollama@0.30.11Q4_K_Mno datano datano datar_r0di2hkku1h
chat-shortollama@0.30.11Q4_K_M55.45tok/s675.2tok/s163msr_1o8q4lhgj88
chat-longollama@0.30.11Q4_K_M57.86tok/s28,422.4tok/s111msr_1o8q4lhgj88
concurrent-decodeollama@0.30.11Q4_K_M67.01tok/sno datano datar_1o8q4lhgj88
agent-traceollama@0.30.11Q4_K_Mno datano datano datar_1o8q4lhgj88
chat-shortollama@0.30.11Q4_K_M92.70tok/s719.0tok/s153msr_40h1dznuznk
chat-longollama@0.30.11Q4_K_M94.61tok/s1,396.5tok/s2,253msr_40h1dznuznk
concurrent-decodeollama@0.30.11Q4_K_M109.1tok/sno datano datar_40h1dznuznk
agent-traceollama@0.30.11Q4_K_Mno datano datano datar_40h1dznuznk
chat-shortollama@0.30.11Q4_K_M93.17tok/s13.30tok/s8,270msr_txvkjgoyyxs

Community folklore

383 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

  • communityconfidence 85%

    65.00tok/s Qwen3-Coder-Next on RTX 3090 via vllm FP8

    Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly, affecting this model especially, that c…

    source: Reddit · u/rmhubbert · 2026-03-04

  • communityconfidence 85%

    71.00tok/s Qwen3-Coder-Next on RTX 3090 via vllm FP8

    sted. I'm running Unsloth's Qwen3-Coder-Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly…

    source: Reddit · u/rmhubbert · 2026-03-04

  • communityconfidence 85%

    65.00tok/s Qwen3-Coder-Next on RTX 3090 via vllm FP8

    Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly, affecting this model especially, that c…

    source: Reddit · u/rmhubbert · 2026-03-04

  • communityconfidence 85%

    71.00tok/s Qwen3-Coder-Next on RTX 3090 via vllm FP8

    sted. I'm running Unsloth's Qwen3-Coder-Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly…

    source: Reddit · u/rmhubbert · 2026-03-04

  • communityconfidence 85%

    65.00tok/s Qwen3-Coder-Next on RTX 3090 via vllm FP8

    Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly, affecting this model especially, that c…

    source: Reddit · u/rmhubbert · 2026-03-04

  • communityconfidence 85%

    71.00tok/s Qwen3-Coder-Next on RTX 3090 via vllm FP8

    sted. I'm running Unsloth's Qwen3-Coder-Next-FP8-Dynamic quant with vLLM v0.16.0 over 4 RTX 3090s with a 200k context window. I get around 71tps at low context, dropping to around 65tps at the 100k context point. Something to note, I'm seeing pretty bad issues with vLLM nightly…

    source: Reddit · u/rmhubbert · 2026-03-04

  • communityconfidence 85%

    150.0tok/s Qwen3-Coder on RTX 3090 via vllm FP16

    addition vRAM for a large enough context window for it to be useful for coding. For reference I’m able to run Qwen3-Coder-30B-A3B at Q8 at 150tok/s with a context window of 128k (FP16 KV) on dual 3090s. When using extensions like Cline you really need a large context window to b…

    source: Reddit · u/Tyme4Trouble · 2025-08-05

  • communityconfidence 85%

    66.00tok/s qwen3-coder on RTX 5080 via llama.cpp Q4_K_M

    I get around ~66 t/s (16k/32k context, Q4_K_M) with very similar but Notebook hardware: AMD Ryzen 9955HX3D 64GB DDR5-5600 Nvidia RTX 5080 Mobile 16

    source: Reddit · u/Danmoreng · 2026-02-25

  • communityconfidence 75%

    30.00tok/s Qwen-3 Coder on RTX 3090 Ti via ollama FP16

    on: **FP16** * Hardware: **RTX 5090 + RTX 3090 Ti** * Task: code generation **Results:** * **llama.cpp:** \~52 tokens/sec * **Ollama:** \~30 tokens/sec Both runs use the same model weights and hardware. The gap is \~70% in favor of llama.cpp. Has anyone dug into why this happ…

    source: Reddit · u/Shoddy_Bed3240 · 2026-01-07

  • communityconfidence 75%

    52.00tok/s Qwen-3 Coder on RTX 3090 Ti via llama.cpp FP16

    **Qwen-3 Coder 32B** * Precision: **FP16** * Hardware: **RTX 5090 + RTX 3090 Ti** * Task: code generation **Results:** * **llama.cpp:** \~52 tokens/sec * **Ollama:** \~30 tokens/sec Both runs use the same model weights and hardware. The gap is \~70% in favor of llama.cpp. Has…

    source: Reddit · u/Shoddy_Bed3240 · 2026-01-07

See all 383 claims for qwen3-coder

qwen3-coder on hardware