Methodology
How we measure
One reproducible suite, every backend. Below: workloads, hardware fingerprinting, anti-gaming, and how to dispute a result.
Methodology
How llm-speed measures inference performance, what we capture per run, what
we deliberately don't, and how the results stay comparable across machines.
Goal
A single command — llm-speed bench — that produces reproducible, comparable
LLM inference numbers on any hardware, against any backend, for any model.
Same workload protocol everywhere. Same captured signals. Same canonical
schema. Numbers from a Mac Studio, an RTX 4090 rig, and a hosted-API endpoint
sit on the same table without footnotes.
Workload suite
Every benchmark run executes a fixed sequence of workloads. The suite is
versioned (suite-v1); a run's results are pinned to the exact suite version
that produced them. Suite changes never invalidate older data.
W1 — chat-short
A short prompt → short completion. Baseline number for any model.
- Prompt: ~128 tokens (a fixed natural-language question)
- Output: 256 tokens
- Batch: 1
- Reports: TTFT, prefill tok/s, decode tok/s, p50 / p95 decode latency
W2 — chat-long
A 4 k-token prompt → 1 k-token completion. Surfaces prefill scaling.
- Prompt: 4 096 tokens (fixture paragraphs concatenated to length)
- Output: 1 024 tokens
- Batch: 1
- Reports: TTFT, prefill tok/s, decode tok/s, latency percentiles
W3 — long-context-decay
The same model evaluated at increasing input context (32 k, 64 k, 128 k tokens where supported), each yielding a 256-token completion. Captures how prefill and decode degrade with context length.
- Reports: an array of
{ context_tokens, prefill_tps, decode_tps, ttft_ms }
This workload is opt-in (skipped from the default suite) because it's slow
and hits memory limits on smaller rigs. Use --workload long-context-decay.
W4 — concurrent-decode
Multiple simultaneous decode streams at batch sizes 1, 4, 8, 16. Shows aggregate throughput vs per-stream latency tradeoff.
- Each stream: 1 024-token prompt → 256-token completion
- Reports: per-batch
{ batch, aggregate_tps, per_stream_tps, p50_ms, p95_ms }
Backends without true concurrent decoding (e.g. MLX runs serially in-process) are flagged on the result so the number isn't misread as parallel throughput.
W5 — agent-trace
A canned multi-turn coding-agent trace with realistic tool calls (read-file, write-file, run-tests, fix, repeat). The trace itself is fixed bytes — every machine sees identical context at every turn — so timings are comparable. Context grows to ~16 k tokens by the final turn.
- Reports: per-turn TTFT and decode tok/s, end-to-end wall time, prefix-cache hit rate where the backend exposes it.
This workload exists because chat completion alone misses the workload that matters most for daily-driver LLM use today: agents that drive the model through 10+ tool-call turns over a growing context.
Hardware fingerprint
Captured automatically on every run. Always coarse — class, not device.
| Field | Example |
|---|---|
| OS | Darwin 26 (major version only — no patch) |
| CPU | Apple M3 Pro, 12 cores |
| RAM | 40 GB (rounded to nearest 8 GB bucket) |
| GPU | M3 Pro, 18-core GPU (no driver build, no PCI bus ID) |
| Backend versions | mlx 0.31.3, llama.cpp 7410, cuda 13.0 |
| Accelerator summary | "M3 Pro (18-core GPU) + 36 GB unified" |
The fingerprint hash is a SHA-256 over only the bucketed fields above. Two machines with identical hardware produce identical hashes — a class identifier, not a device identifier.
In --strict-anon mode, the hash is omitted entirely.
Captured signals per workload result
{
"workload": "chat-short",
"ttft_ms": 142.3,
"prefill_tps": 8421.1,
"decode_tps": 187.4,
"decode_p50_latency_ms": 5.3,
"decode_p95_latency_ms": 7.1,
"prompt_tokens": 128,
"output_tokens": 256,
"wall_ms": 1530,
"prefix_cache_hit_rate": null,
"backend": "llama.cpp",
"backend_version": "7410",
"model": {
"identifier": "Qwen/Qwen2.5-7B-Instruct-GGUF",
"name": "Qwen2.5-7B-Instruct",
"size": "7B",
"quant": "Q4_K_M",
"digest": "<sha256 prefix of weights file>"
}
}
The model digest pins the exact weights — so two "Llama 3.3 70B Q4_K_M" results from different sources don't get blended if one used a different quant pipeline.
Per-token decode timings (raw_timings_ms) are kept locally for replay
but not uploaded by default. They're available on dispute request.
Supported backends
llm-speed autodetects each backend via a thin driver module. Drivers are
isolated: a missing dependency means the driver is skipped, not crashed.
| Backend | Detection | Status |
|---|---|---|
llama.cpp | llama-server / llama-cli on PATH | First-class |
ollama | daemon reachable on localhost:11434 | First-class |
mlx | Apple Silicon + mlx-lm installed | First-class |
vllm | running HTTP server, or vllm Python package | First-class |
exllamav2 | CUDA + exllamav2 Python package | First-class |
hosted-api | provider API key in env (OPENAI_API_KEY, OPENROUTER_API_KEY, TOGETHER_API_KEY, FIREWORKS_API_KEY, GROQ_API_KEY, ANTHROPIC_API_KEY) | First-class (OpenAI-compatible + Anthropic adapter) |
Reproducibility & data trust
Every submission is signed. The CLI signs each upload with an Ed25519 keypair as a JWS (RFC 7515) compact token. The public key rides in the protected header, so anyone can verify. The server rejects any submission whose signature doesn't match its payload.
Backend version + build flags are captured. A result is pinned not just to (model × hardware × backend) but to the exact backend build that ran it. Numbers don't drift silently when llama.cpp ships a new release.
Outliers are flagged, not deleted. Server-side outlier detection marks runs that fall more than 3σ from their (model × hardware × backend) cluster for community review. The dispute thread is public.
Suite versioning preserves history. When the workload suite gets new
prompts or new tests, it gets a new version (suite-v2). Old suite-v1
results stay valid — they're filterable, not invalidated.
The harness is open source. The exact bytes the CLI sends are
inspectable via llm-speed bench --dry-run --print-payload. The repository
on GitHub contains the full client, the API, and this methodology.
What we deliberately don't measure
- Quality regression. Whether a model produces coherent output is orthogonal to how fast it produces it. There are good benchmarks for quality (LMSYS Arena, GDPval, MMLU-Pro). We don't replicate them.
- Power efficiency (
tok/s/W). Interesting but adds wall-clock instrumentation complexity. On the roadmap. - Distributed multi-node throughput. We measure single-node first.
Versioning
The current methodology is suite-v1. The suite_version field on every
result identifies which protocol produced it. Future versions will be
backwards-compatible at the schema level — a run from an older suite stays
accessible and comparable within its version.