Skip to content
llm-speed

llama3.1

8 workload results across 2 hardware configurations.

Fastest local config

154.4 decode tok/s

on RTX 4090 (48GB) + AMD EPYC 7763 64-Core Processor (128c) + 1008GB via ollama (Q4_K_M) see full run

Local runs (8 runs)

Runs from contributors' own machines via MLX, llama.cpp, vLLM, exllamav2, or ollama. Signed on the submitter's hardware.

RTX 4090 (48GB) + AMD EPYC 7763 64-Core Processor (128c) + 1008GBRTX 4090 (48GB) + AMD EPYC 7763 64-Core Processor (128c) + 1008GB

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortollama@0.31.1Q4_K_M154.4tok/s328.0tok/s335msr_h1ub_1uxzdh
chat-longollama@0.31.1Q4_K_M146.3tok/s5,032.5tok/s623msr_h1ub_1uxzdh
concurrent-decodeollama@0.31.1Q4_K_M154.2tok/sno datano datar_h1ub_1uxzdh
agent-traceollama@0.31.1Q4_K_M149.4tok/s5,377.5tok/s387msr_h1ub_1uxzdh

RTX 3090 (24GB) + AMD EPYC 7663 56-Core Processor (56c) + 252GBRTX 3090 (24GB) + AMD EPYC 7663 56-Core Processor (56c) + 252GB

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortollama@0.31.1Q4_K_M136.2tok/s2.20tok/s49,960msr_9hurqggbshk
chat-longollama@0.31.1Q4_K_M127.5tok/s3,004.3tok/s1,044msr_9hurqggbshk
concurrent-decodeollama@0.31.1Q4_K_M133.9tok/sno datano datar_9hurqggbshk
agent-traceollama@0.31.1Q4_K_M127.5tok/s3,475.9tok/s527msr_9hurqggbshk

llama3.1 on hardware