Skip to content
llm-speed

Fastest rig for Qwen2.5-Coder-32B (local)

Qwen2.5-Coder-32B is the strongest fully-open dense coding model that fits a single prosumer card at 4-bit. Here's the fastest decode tok/s submitted across every GPU and Apple Silicon tier.

Verdict

As of today, the fastest submitted Qwen2.5-Coder-32B decode is on RTX 5090 (32GB) at 71.91tok/s. Runner-up is M3 Ultra (60-core GPU) at 34.48tok/s, a 109% gap to the leader. Qwen2.5-Coder-32B is the strongest fully-open dense coder that fits a single prosumer card at 4-bit; the table below ranks every rig we have a real run for so you can match the buy to your budget.

Recommendation

Fastest submitted run for Qwen2.5-Coder-32B: RTX 5090 (32GB) at 71.91tok/s.

Qwen2.5-Coder-32B at Q4_K_M needs ~20 GB of weights plus KV cache, which lands it on a single 24-32 GB GPU (RTX 5090 / 4090 / 3090), a two-card split, or any Apple Silicon Max or Ultra with unified memory. Decode speed is set by memory bandwidth once the model fits, so the ranking below is the honest picture: every rig we have a real submitted run for, fastest first, each row linking the signed run. If your config isn't here yet, run the suite and submit — the row appears next refresh.

Submitted benchmarks

Hardwaredecode tok/sWorkloadRun
RTX 5090 (32GB)71.91tok/schat-shortr_nkbs6d3-d21
M3 Ultra (60-core GPU)34.48tok/schat-shortr_721b4bls_oq

See also: All hardware · All models · Methodology