qwen2.5-coder
16 workload results across 3 hardware configurations.
Fastest local config
161.1 decode tok/s
on RTX 4090 (48GB) + AMD EPYC 7763 64-Core Processor (128c) + 1008GB via ollama (Q4_K_M) — see full run
Local runs (16 runs)
Runs from contributors' own machines via MLX, llama.cpp, vLLM, exllamav2, or ollama. Signed on the submitter's hardware.
RTX 4090 (48GB) + AMD EPYC 7763 64-Core Processor (128c) + 1008GB
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | ollama@0.31.1 | Q4_K_M | 89.45tok/s | 466.6tok/s | 281ms | r_73tnfdueq2h |
| chat-long | ollama@0.31.1 | Q4_K_M | 83.84tok/s | 3,572.3tok/s | 887ms | r_73tnfdueq2h |
| concurrent-decode | ollama@0.31.1 | Q4_K_M | 88.63tok/s | no data | no data | r_73tnfdueq2h |
| agent-trace | ollama@0.31.1 | Q4_K_M | 86.21tok/s | 5,298.4tok/s | 395ms | r_73tnfdueq2h |
| chat-short | ollama@0.31.1 | Q4_K_M | 159.4tok/s | 452.0tok/s | 290ms | r_mv8n8k9wu1e |
| chat-long | ollama@0.31.1 | Q4_K_M | 154.7tok/s | 5,998.7tok/s | 528ms | r_mv8n8k9wu1e |
| concurrent-decode | ollama@0.31.1 | Q4_K_M | 161.1tok/s | no data | no data | r_mv8n8k9wu1e |
| agent-trace | ollama@0.31.1 | Q4_K_M | 158.1tok/s | 6,421.1tok/s | 320ms | r_mv8n8k9wu1e |
RTX 3090 (24GB) + AMD EPYC 7702P 64-Core Processor (64c) + 252GB
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | ollama@0.31.1 | Q4_K_M | 69.21tok/s | 402.0tok/s | 326ms | r_6ahy-dq0f_0 |
| chat-long | ollama@0.31.1 | Q4_K_M | 65.83tok/s | 2,070.6tok/s | 1,530ms | r_6ahy-dq0f_0 |
| concurrent-decode | ollama@0.31.1 | Q4_K_M | 67.64tok/s | no data | no data | r_6ahy-dq0f_0 |
| agent-trace | ollama@0.31.1 | Q4_K_M | 63.63tok/s | 3,518.8tok/s | 483ms | r_6ahy-dq0f_0 |
RTX 3090 (24GB) + AMD EPYC 7663 56-Core Processor (56c) + 252GB
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | ollama@0.31.1 | Q4_K_M | 39.86tok/s | 2.83tok/s | 46,337ms | r_pb0jpnbujji |
| chat-long | ollama@0.31.1 | Q4_K_M | 135.0tok/s | 3,319.4tok/s | 954ms | r_pb0jpnbujji |
| concurrent-decode | ollama@0.31.1 | Q4_K_M | 139.2tok/s | no data | no data | r_pb0jpnbujji |
| agent-trace | ollama@0.31.1 | Q4_K_M | 134.2tok/s | 4,111.1tok/s | 487ms | r_pb0jpnbujji |