Skip to content
llm-speed

r1

16 workload results across 1 hardware configuration.

Fastest local config

133.8 decode tok/s

on RTX 4090 (24GB) + AMD EPYC 75F3 32-Core Processor (64c) + 504GB via ollama (Q4_K_M) see full run

Local runs (16 runs)

Runs from contributors' own machines via MLX, llama.cpp, vLLM, exllamav2, or ollama. Signed on the submitter's hardware.

RTX 4090 (24GB) + AMD EPYC 75F3 32-Core Processor (64c) + 504GBRTX 4090 (24GB) + AMD EPYC 75F3 32-Core Processor (64c) + 504GB

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortollama@0.31.1Q4_K_Mno datano datano datar_zrjj-pj93q8
chat-longollama@0.31.1Q4_K_M81.09tok/s240.4tok/s13,070msr_zrjj-pj93q8
concurrent-decodeollama@0.31.1Q4_K_Mno datano datano datar_zrjj-pj93q8
agent-traceollama@0.31.1Q4_K_Mno datano datano datar_zrjj-pj93q8
chat-shortollama@0.31.1Q4_K_Mno datano datano datar_fg77v2hhohb
chat-longollama@0.31.1Q4_K_M131.9tok/s686.3tok/s4,578msr_fg77v2hhohb
concurrent-decodeollama@0.31.1Q4_K_Mno datano datano datar_fg77v2hhohb
agent-traceollama@0.31.1Q4_K_M133.8tok/s8,852.4tok/s366msr_fg77v2hhohb
chat-shortollama@0.31.1Q4_K_Mno datano datano datar_l2ck67ls40i
chat-longollama@0.31.1Q4_K_M3.76tok/s14.28tok/s219,986msr_l2ck67ls40i
concurrent-decodeollama@0.31.1Q4_K_Mno datano datano datar_l2ck67ls40i
agent-traceollama@0.31.1Q4_K_M3.79tok/s3,515.1tok/s971msr_l2ck67ls40i
chat-shortollama@0.31.1Q4_K_Mno datano datano datar_2r1p3w9ps7c
chat-longollama@0.31.1Q4_K_M131.0tok/s599.9tok/s5,238msr_2r1p3w9ps7c
concurrent-decodeollama@0.31.1Q4_K_Mno datano datano datar_2r1p3w9ps7c
agent-traceollama@0.31.1Q4_K_M132.4tok/s9,612.8tok/s337msr_2r1p3w9ps7c

r1 on hardware