Skip to content
llm-speed

gemma3

12 workload results across 1 hardware configuration.

Fastest local config

195.0 decode tok/s

on RTX 4090 (24GB) + AMD EPYC 7443 24-Core Processor (24c) + 252GB via ollama (Q4_K_M) see full run

Local runs (12 runs)

Runs from contributors' own machines via MLX, llama.cpp, vLLM, exllamav2, or ollama. Signed on the submitter's hardware.

RTX 4090 (24GB) + AMD EPYC 7443 24-Core Processor (24c) + 252GBRTX 4090 (24GB) + AMD EPYC 7443 24-Core Processor (24c) + 252GB

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortollama@0.31.1Q4_K_M46.95tok/s144.5tok/s817msr_x23y_sg24pm
chat-longollama@0.31.1Q4_K_M45.25tok/s1,594.1tok/s1,998msr_x23y_sg24pm
concurrent-decodeollama@0.31.1Q4_K_M46.17tok/sno datano datar_x23y_sg24pm
agent-traceollama@0.31.1Q4_K_M45.33tok/s2,556.8tok/s956msr_x23y_sg24pm
chat-shortollama@0.31.1Q4_K_M92.65tok/s171.4tok/s689msr_3kmrc135e0e
chat-longollama@0.31.1Q4_K_M87.60tok/s2,206.4tok/s1,444msr_3kmrc135e0e
concurrent-decodeollama@0.31.1Q4_K_M90.43tok/sno datano datar_3kmrc135e0e
agent-traceollama@0.31.1Q4_K_M88.53tok/s3,123.9tok/s815msr_3kmrc135e0e
chat-shortollama@0.31.1Q4_K_M195.0tok/s167.2tok/s706msr_dlanfbgym0h
chat-longollama@0.31.1Q4_K_M187.6tok/s3,254.1tok/s979msr_dlanfbgym0h
concurrent-decodeollama@0.31.1Q4_K_M193.1tok/sno datano datar_dlanfbgym0h
agent-traceollama@0.31.1Q4_K_M190.0tok/s3,543.4tok/s715msr_dlanfbgym0h

gemma3 on hardware