Methodology

How we measure

One reproducible suite, every backend. Below: workloads, hardware fingerprinting, anti-gaming, and how to dispute a result.

Methodology

How llm-speed measures inference performance, what we capture per run, what we deliberately don't, and how the results stay comparable across machines.

Goal

A single command — llm-speed bench — that produces reproducible, comparable LLM inference numbers on any hardware, against any backend, for any model. Same workload protocol everywhere. Same captured signals. Same canonical schema. Numbers from a Mac Studio, an RTX 4090 rig, and a hosted-API endpoint sit on the same table without footnotes.

Workload suite

Every benchmark run executes a fixed sequence of workloads. The suite is versioned (suite-v1); a run's results are pinned to the exact suite version that produced them. Suite changes never invalidate older data.

W1 — `chat-short`

A short prompt → short completion. Baseline number for any model.

Prompt: ~128 tokens (a fixed natural-language question)
Output: 256 tokens
Batch: 1
Reports: TTFT, prefill tok/s, decode tok/s, p50 / p95 decode latency

W2 — `chat-long`

A 4 k-token prompt → 1 k-token completion. Surfaces prefill scaling.

Prompt: 4 096 tokens (fixture paragraphs concatenated to length)
Output: 1 024 tokens
Batch: 1
Reports: TTFT, prefill tok/s, decode tok/s, latency percentiles

W3 — `long-context-decay`

The same model evaluated at increasing input context (32 k, 64 k, 128 k tokens where supported), each yielding a 256-token completion. Captures how prefill and decode degrade with context length.

Reports: an array of { context_tokens, prefill_tps, decode_tps, ttft_ms }

This workload is opt-in (skipped from the default suite) because it's slow and hits memory limits on smaller rigs. Use --workload long-context-decay.

W4 — `concurrent-decode`

Multiple simultaneous decode streams at batch sizes 1, 4, 8, 16. Shows aggregate throughput vs per-stream latency tradeoff.

Each stream: 1 024-token prompt → 256-token completion
Reports: per-batch { batch, aggregate_tps, per_stream_tps, p50_ms, p95_ms }

Backends without true concurrent decoding (e.g. MLX runs serially in-process) are flagged on the result so the number isn't misread as parallel throughput.

W5 — `agent-trace`

A canned multi-turn coding-agent trace with realistic tool calls (read-file, write-file, run-tests, fix, repeat). The trace itself is fixed bytes — every machine sees identical context at every turn — so timings are comparable. Context grows to ~16 k tokens by the final turn.

Reports: per-turn TTFT and decode tok/s, end-to-end wall time, prefix-cache hit rate where the backend exposes it.

This workload exists because chat completion alone misses the workload that matters most for daily-driver LLM use today: agents that drive the model through 10+ tool-call turns over a growing context.

Hardware fingerprint

Captured automatically on every run. Always coarse — class, not device.

Field	Example
OS	`Darwin 26` (major version only — no patch)
CPU	`Apple M3 Pro`, 12 cores
RAM	40 GB (rounded to nearest 8 GB bucket)
GPU	`M3 Pro`, 18-core GPU (no driver build, no PCI bus ID)
Backend versions	`mlx 0.31.3`, `llama.cpp 7410`, `cuda 13.0`
Accelerator summary	"M3 Pro (18-core GPU) + 36 GB unified"

The fingerprint hash is a SHA-256 over only the bucketed fields above. Two machines with identical hardware produce identical hashes — a class identifier, not a device identifier.

In --strict-anon mode, the hash is omitted entirely.

Captured signals per workload result

{
  "workload": "chat-short",
  "ttft_ms": 142.3,
  "prefill_tps": 8421.1,
  "decode_tps": 187.4,
  "decode_p50_latency_ms": 5.3,
  "decode_p95_latency_ms": 7.1,
  "prompt_tokens": 128,
  "output_tokens": 256,
  "wall_ms": 1530,
  "prefix_cache_hit_rate": null,
  "backend": "llama.cpp",
  "backend_version": "7410",
  "model": {
    "identifier": "Qwen/Qwen2.5-7B-Instruct-GGUF",
    "name": "Qwen2.5-7B-Instruct",
    "size": "7B",
    "quant": "Q4_K_M",
    "digest": "<sha256 prefix of weights file>"
  }
}

The model digest pins the exact weights — so two "Llama 3.3 70B Q4_K_M" results from different sources don't get blended if one used a different quant pipeline.

Per-token decode timings (raw_timings_ms) are kept locally for replay but not uploaded by default. They're available on dispute request.

Supported backends

llm-speed autodetects each backend via a thin driver module. Drivers are isolated: a missing dependency means the driver is skipped, not crashed.

Backend	Detection	Status
`llama.cpp`	`llama-server` / `llama-cli` on PATH	First-class
`ollama`	daemon reachable on `localhost:11434`	First-class
`mlx`	Apple Silicon + `mlx-lm` installed	First-class
`vllm`	running HTTP server, or `vllm` Python package	First-class
`exllamav2`	CUDA + `exllamav2` Python package	First-class
`hosted-api`	provider API key in env (`OPENAI_API_KEY`, `OPENROUTER_API_KEY`, `TOGETHER_API_KEY`, `FIREWORKS_API_KEY`, `GROQ_API_KEY`, `ANTHROPIC_API_KEY`)	First-class (OpenAI-compatible + Anthropic adapter)

Reproducibility & data trust

Every submission is signed. The CLI signs each upload with an Ed25519 keypair as a JWS (RFC 7515) compact token. The public key rides in the protected header, so anyone can verify. The server rejects any submission whose signature doesn't match its payload.

Backend version + build flags are captured. A result is pinned not just to (model × hardware × backend) but to the exact backend build that ran it. Numbers don't drift silently when llama.cpp ships a new release.

Outliers are flagged, not deleted. Server-side outlier detection marks runs that fall more than 3σ from their (model × hardware × backend) cluster for community review. The dispute thread is public.

Suite versioning preserves history. When the workload suite gets new prompts or new tests, it gets a new version (suite-v2). Old suite-v1 results stay valid — they're filterable, not invalidated.

The harness is open source. The exact bytes the CLI sends are inspectable via llm-speed bench --dry-run --print-payload. The repository on GitHub contains the full client, the API, and this methodology.

What we deliberately don't measure

Quality regression. Whether a model produces coherent output is orthogonal to how fast it produces it. There are good benchmarks for quality (LMSYS Arena, GDPval, MMLU-Pro). We don't replicate them.
Power efficiency (tok/s/W). Interesting but adds wall-clock instrumentation complexity. On the roadmap.
Distributed multi-node throughput. We measure single-node first.

Versioning

The current methodology is suite-v1. The suite_version field on every result identifies which protocol produced it. Future versions will be backwards-compatible at the schema level — a run from an older suite stays accessible and comparable within its version.

Methodology

Goal

Workload suite

W1 — chat-short

W2 — chat-long

W3 — long-context-decay

W4 — concurrent-decode

W5 — agent-trace

Hardware fingerprint

Captured signals per workload result

Supported backends

Reproducibility & data trust

What we deliberately don't measure

Versioning

W1 — `chat-short`

W2 — `chat-long`

W3 — `long-context-decay`

W4 — `concurrent-decode`

W5 — `agent-trace`