llama3.1
8 workload results across 2 hardware configurations.
Fastest local config
154.4 decode tok/s
on RTX 4090 (48GB) + AMD EPYC 7763 64-Core Processor (128c) + 1008GB via ollama (Q4_K_M) — see full run
Local runs (8 runs)
Runs from contributors' own machines via MLX, llama.cpp, vLLM, exllamav2, or ollama. Signed on the submitter's hardware.
RTX 4090 (48GB) + AMD EPYC 7763 64-Core Processor (128c) + 1008GB
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | ollama@0.31.1 | Q4_K_M | 154.4tok/s | 328.0tok/s | 335ms | r_h1ub_1uxzdh |
| chat-long | ollama@0.31.1 | Q4_K_M | 146.3tok/s | 5,032.5tok/s | 623ms | r_h1ub_1uxzdh |
| concurrent-decode | ollama@0.31.1 | Q4_K_M | 154.2tok/s | no data | no data | r_h1ub_1uxzdh |
| agent-trace | ollama@0.31.1 | Q4_K_M | 149.4tok/s | 5,377.5tok/s | 387ms | r_h1ub_1uxzdh |
RTX 3090 (24GB) + AMD EPYC 7663 56-Core Processor (56c) + 252GB
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | ollama@0.31.1 | Q4_K_M | 136.2tok/s | 2.20tok/s | 49,960ms | r_9hurqggbshk |
| chat-long | ollama@0.31.1 | Q4_K_M | 127.5tok/s | 3,004.3tok/s | 1,044ms | r_9hurqggbshk |
| concurrent-decode | ollama@0.31.1 | Q4_K_M | 133.9tok/s | no data | no data | r_9hurqggbshk |
| agent-trace | ollama@0.31.1 | Q4_K_M | 127.5tok/s | 3,475.9tok/s | 527ms | r_9hurqggbshk |