Skip to content
llm-speed

Methodology

How we measure LLM speed

One reproducible suite, every backend. Below: workloads, hardware fingerprinting, signed results, and how to dispute a number.

TL;DR

  • What we measure. Decode tok/s — output tokens per wall-clock second after the first token. Prefill tok/s — prompt tokens per wall-clock second before generation starts. TTFT — milliseconds between request leaving the client and the first token arriving back.
  • How we sign. Every upload is an Ed25519 signature over a canonicalised JSON payload, served as a JWS (RFC 7515) compact token. The 32-byte raw public key rides in the protected header so anyone can verify without contacting us.
  • What we don't do. No cherry-picking — every workload result is uploaded. No retry-and-keep-best — each (workload, model, backend) runs once per invocation. No hidden outliers — runs more than 3σ from their cluster are flagged on the page, not deleted.
  • What you can audit. https://llm-speed.com/r/<id> for any signed run, the exact bytes the CLI signs via llm-speed bench --dry-run --print-payload, and this document at docs/METHODOLOGY.md.

1. Workload suite

Every benchmark run executes a fixed sequence of workloads. The suite is versioned (suite-v1); each result is pinned to the exact version that produced it. New suites get a new version; old runs stay valid and filterable.

W1 — chat-short

A short prompt followed by a short completion. Baseline number for any model.

FieldValue
Prompt128 tokens (a fixed natural-language question, byte-identical on every machine)
Output256 tokens
Batch1
ReportsTTFT (ms), prefill tok/s, decode tok/s, p50 / p95 per-token decode latency

W2 — chat-long

A 4 096-token prompt and a 1 024-token completion. Surfaces prefill scaling.

FieldValue
Prompt4 096 tokens (fixture paragraphs concatenated to length, byte-identical across machines)
Output1 024 tokens
Batch1
ReportsTTFT, prefill tok/s, decode tok/s, p50 / p95 latency

W3 — long-context-decay

The same model evaluated at 32 k, 64 k, and 128 k input tokens (where the backend supports each tier), each producing a 256-token completion. Captures how prefill and decode degrade with context length.

  • Reports: an array of { context_tokens, prefill_tps, decode_tps, ttft_ms }.
  • Opt-in (skipped by default — slow on small rigs and exceeds memory on short-VRAM hardware). Run with --workload long-context-decay.

W4 — concurrent-decode

Multiple simultaneous decode streams at batch sizes 1, 4, 8, 16. Measures the trade-off between aggregate throughput and per-stream latency.

  • Each stream: 1 024-token prompt → 256-token completion.
  • Reports: per-batch { batch, aggregate_tps, per_stream_tps, p50_ms, p95_ms }.
  • Backends without true concurrent decoding (MLX runs serially in-process) are flagged on the result so the number is not misread as parallel throughput.

W5 — agent-trace

A canned multi-turn coding-agent trace with realistic tool calls (read-file, write-file, run-tests, fix, repeat). The trace is fixed bytes — every machine sees identical context at every turn — so timings are comparable. Context grows to roughly 16 k tokens by the final turn.

  • Reports: per-turn TTFT and decode tok/s, end-to-end wall time, and prefix-cache hit rate where the backend exposes it.

2. Hardware fingerprint

Captured automatically on every run. Always coarse — class, not device.

FieldExample
OSDarwin 26 (major version only — no patch level)
CPUApple M3 Pro, 12 cores
RAM40 GB (rounded to nearest 8 GB bucket)
GPUM3 Pro, 18-core GPU (no driver build, no PCI bus ID)
Backend versionsmlx 0.31.3, llama.cpp 7410, cuda 13.0
Accelerator summaryM3 Pro (18-core GPU) + 36 GB unified

The fingerprint_hash is a SHA-256 over only the bucketed fields above. Two machines with identical hardware produce identical hashes — a class identifier, not a device identifier. In --strict-anon mode the hash is omitted entirely.

The exact list of collected and excluded fields is documented in docs/PRIVACY.md.


3. Captured signals per workload result

{
  "workload": "chat-short",
  "ttft_ms": 142.3,
  "prefill_tps": 8421.1,
  "decode_tps": 187.4,
  "decode_p50_latency_ms": 5.3,
  "decode_p95_latency_ms": 7.1,
  "prompt_tokens": 128,
  "output_tokens": 256,
  "wall_ms": 1530,
  "prefix_cache_hit_rate": null,
  "backend": "llama.cpp",
  "backend_version": "7410",
  "model": {
    "identifier": "Qwen/Qwen2.5-7B-Instruct-GGUF",
    "name": "Qwen2.5-7B-Instruct",
    "size": "7B",
    "quant": "Q4_K_M",
    "digest": "<sha256 prefix of the weights file>"
  }
}

model.digest pins the exact weights — two Llama 3.3 70B Q4_K_M results from different sources are not blended unless the digests match.

Per-token decode timings (raw_timings_ms) are kept locally for replay and dispute. They are not uploaded by default.


4. Supported backends

llm-speed autodetects each backend via a thin driver module. A missing dependency means the driver is skipped, not crashed.

BackendDetection
llama.cppllama-server or llama-cli on PATH
ollamadaemon reachable on localhost:11434
mlxApple Silicon + mlx-lm installed
vllmrunning HTTP server, or vllm Python package
exllamav2CUDA + exllamav2 Python package
hosted-apiprovider API key in environment: OPENAI_API_KEY, OPENROUTER_API_KEY, TOGETHER_API_KEY, FIREWORKS_API_KEY, GROQ_API_KEY, ANTHROPIC_API_KEY (OpenAI-compatible HTTP, plus a direct Anthropic adapter)

TensorRT-LLM and SGLang are not yet supported.


5. Signing and verification

Every submission is signed. The CLI generates an Ed25519 keypair on first run (~/.config/llm-speed/keys/ed25519.key, mode 0600). Each upload is a JWS compact token whose payload is the canonicalised result JSON and whose protected header contains the 32-byte raw public key. The server rejects any submission whose signature does not verify against the embedded key.

The same canonicalisation runs on the client and on the server, so the bytes the CLI hashes are exactly the bytes the server hashes — modulo the signature and public_key fields, which are excluded from the canonical form.

Inspect the bytes that would be signed before any upload:

llm-speed bench --quick --dry-run --print-payload

--dry-run runs the workloads but skips upload and local save. --print-payload writes the exact JSON payload to stdout.


6. What we don't do

  • No cherry-picking. Every workload result produced by a single llm-speed bench invocation is uploaded together. The CLI cannot drop individual workloads from a submission; the only choice is "submit all of this run, or none of it."
  • No retry-and-keep-best. Each (workload, model, backend) triple runs exactly once per invocation. To produce multiple measurements, run the CLI multiple times — every invocation is a new signed submission, visible on the leaderboard, with its own permalink.
  • No hidden outliers. Server-side outlier detection flags runs that fall more than 3σ from their (model × hardware × backend) cluster. Flagged runs stay on their permalinks and on the per-model and per-hardware pages, marked for community review. The dispute thread is public.
  • No silent backend drift. backend_version is captured per result, so a number does not change meaning when llama.cpp or vLLM ships a new release — it produces a new row pinned to the new version.
  • No quality scoring. Whether a model produces coherent output is orthogonal to how fast it produces it. Quality benchmarks exist (LMSYS Chatbot Arena, GDPval, MMLU-Pro, HumanEval); we do not replicate them.
  • No power efficiency yet. tok/s/W is not measured. Coming when the per-platform power-meter integration lands.
  • No multi-node throughput. Single-node only.

7. Reproducibility

Anyone reading this page should be able to do three things.

7.1 Verify the published wheel matches the open-source CLI

llm-speed verify

verify prints the SHA-256 of the installed CLI's pinned dependency tree, the git commit it was built from, and the URL of the wheel. Compare against the GitHub release asset to confirm the wheel and the source agree.

7.2 Re-run the same workload against any signed run

Every /r/<id> page lists suite_version, backend, backend_version, model.identifier, model.quant, and model.digest. Reproduce the run with:

llm-speed bench \
  --suite suite-v1 \
  --workload chat-short \
  --backend mlx \
  --model mlx-community/Qwen2.5-7B-Instruct-4bit

On hardware in the same accelerator_summary class, your numbers should land within roughly 5 % of the original on decode_tps and within roughly 10 % on ttft_ms — the larger TTFT band reflects thermal state, background load, and KV-cache warmth, none of which the suite controls.

7.3 Audit any specific run

Click the /r/<id> permalink. The page shows the full fingerprint, the public key, the signature, the suite version, the backend version, and every captured signal per workload. The exact JSON the server stores is at https://api.llm-speed.com/v1/results/<id>.

To check a signature locally:

llm-speed verify-run https://llm-speed.com/r/<id>

This fetches the JSON, recomputes the canonical form, and verifies the Ed25519 signature against the embedded public key. Exit code is 0 on a valid signature, non-zero otherwise.


8. Signed runs you can audit

The following five permalinks are the highest decode tok/s currently on the leaderboard, as of 2026-05-01. Each is a real, signed, unedited run; the links resolve to the canonical run page with the full fingerprint, signature, and per-workload signals.

#Decode tok/sModelHardwareBackendRun
1432.2meta-llama/llama-4-scoutM3 Pro (18-core GPU) + 36 GB unifiedhosted-api (OpenRouter)r_g4ph40ae0e1
2286.5mlx-community/Qwen2.5-0.5B-Instruct-4bitM3 Pro (18-core GPU) + 36 GB unifiedmlx 0.31.3r_akcbpx5vcqa
3192.5mlx-community/stable-code-instruct-3b-4bitM3 Ultra (60-core GPU) + 96 GB unifiedmlx 0.31.3r_y2_5y8oo97d
4175.4qwen/qwen3-coder-nextM3 Pro (18-core GPU) + 36 GB unifiedhosted-api (OpenRouter)r_zrnrkjvkvvi
5168.3mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bitM3 Ultra (60-core GPU) + 96 GB unifiedmlx 0.31.3r_l_v1-zq_qaz

The fastest signed run as of 2026-05-01 is 432.2 decode tok/s for meta-llama/llama-4-scout via OpenRouter on the chat-short workload — r_g4ph40ae0e1. The hosted-API number reflects upstream provider speed (OpenRouter routes Llama-4-Scout to wafer-scale serving), not local hardware. The fastest local-hardware run is 286.5 decode tok/s for Qwen2.5-0.5B-Instruct-4bit on M3 Pro via MLX — r_akcbpx5vcqa.

The full live ranking is at /cheatsheet, recomputed from the public API on every page render.


9. How to dispute a number

Every /r/<id> page has a stable permalink and a dispute link. Open a GitHub issue referencing the run ID and the specific number you are challenging. The submitter is invited to reproduce; if they cannot, the result is withdrawn. The exchange is public.

If a result is found to violate the methodology (different workload bytes, modified backend, edited payload), it is removed from the leaderboard and the run page returns a 404 with the message: "The result was withdrawn during a dispute."


10. Versioning

The current methodology is suite-v1. Every result records its suite_version. When the suite changes — new prompts, new workloads, different output lengths — the version increments. Old results stay valid and filterable; they are not invalidated by a new version, only distinguishable from it.

The CLI ships pinned to one suite version per release; a release does not silently change which workloads ran on which result.


11. Open source

The full client, the API, and this methodology are in the public repository:

File an issue if anything in this document does not match the code.

Frequently asked questions

The questions visitors arrive with — answered factually.

How is decode tok/s measured?

Decode tok/s is computed from wall-clock time between the first generated token and the last generated token, divided by the number of generated tokens (excluding the prompt). The measurement excludes prefill so it isolates per-token generation speed. Streaming is enabled when the backend supports it; otherwise the wall-clock fallback is used.

What is TTFT (time to first token)?

TTFT is the wall-clock latency between the request leaving the client and the first generated token arriving back. It captures connection overhead plus prefill time on the backend. On the local-inference backends llm-speed measures, TTFT is dominated by prefill on the prompt and scales with prompt length and model size.

What is prefill tok/s?

Prefill tok/s is the rate at which the backend processes input prompt tokens before generation starts. It's computed as prompt_tokens / prefill_wall_seconds. Prefill speed scales with input length and is the bottleneck for long-context workloads (W2, W3, W5) where the prompt is large.

What's the difference between prefill and decode tok/s?

Prefill processes the prompt in parallel before generation starts; decode generates one output token at a time. Decode is sequential and almost always slower per-token than prefill, sometimes by 10x or more. A backend can have great prefill and mediocre decode (or vice versa); reporting both keeps the comparison honest.

How do you compare a model across different hardware?

We measure the same workload (same prompt bytes, same output length, same batch size) on every machine, with the model digest pinned so two 'same' models with different quant pipelines aren't blended. Results group naturally as (model x backend x hardware), which is what /m/[slug] and /hw/[slug] surface.

Why are tok/s results different across runs of the same setup?

Run-to-run variance comes from thermal state, background load, and KV-cache reuse. We surface every run rather than averaging, so you can see the spread. Outliers more than 3 sigma from a (model x backend x hardware) cluster are flagged for review, not deleted.

What does 'signed' mean on a benchmark result?

Every upload is signed with an Ed25519 keypair generated on the submitting machine. The public key rides in the JWS protected header so anyone can verify the bytes. The server rejects any submission whose signature doesn't match its payload. Signing prevents result tampering and ties resubmissions to the same machine.

Can I dispute a result?

Yes. Every run page has a stable permalink. Open a GitHub issue referencing the run ID with what you think is wrong and the submitter is invited to reproduce. If they can't, the result is withdrawn. The whole exchange is public.

Why not just publish a single composite score?

Composite scores hide the answer to 'is this fast for my workload?' A model can be fast at short chat and slow at long context, or great at decode and poor at prefill. We publish each workload's metrics separately so the reader picks the dimension that matches their use case.

How are multi-host benchmarks reproduced?

Each rig runs the same suite-v1 workloads end-to-end on its own machine; we never aggregate decode tok/s across hosts. The accelerator_summary string on every run page shows the full primary-accelerator + CPU + RAM breakdown of the host the numbers came from, so two RTX 5090 submissions on different motherboards stay distinguishable. Cross-host comparisons happen at the (model x hardware) page level, not inside a single run.

What does 'multi-host sweep' mean on the recent runs?

A sweep is one operator running the same model across several distinct hosts (e.g. an M3 Pro, an M3 Ultra, and an RTX 5090 box) under the same suite-v1 build, then submitting each as its own signed run. Every host gets its own /r/<id> page, its own fingerprint, and its own row on the relevant /m and /hw aggregations — there's no merged 'multi-host' result, just N independent ones.