Glossary

Glossary of llm-speed terms

Every term llm-speed uses when reporting a benchmark, defined in one sentence first and elaborated below. Cross-reference /methodology for workload specs and /privacy for data-handling terms.

tok/s — tokens per second: tok/s, short for tokens per second, is the rate at which a language model produces or consumes tokens during a benchmark.; On llm-speed, tok/s without further qualification refers to decode tok/s — the streaming-output rate users care about during chat.
decode tok/s: decode tok/s is the rate, in tokens per second, at which a model emits new tokens during generation; it is memory-bandwidth-bound and runs one token at a time per stream.; llm-speed measures decode tok/s as wall-clock output_tokens divided by decode wall time, batch size 1 unless the workload (concurrent-decode) says otherwise.
prefill tok/s: prefill tok/s is the rate, in tokens per second, at which a model ingests the prompt before generation begins; it is compute-bound and parallelizes across the prompt length.; Prefill tok/s is typically 10x to 100x higher than decode tok/s on the same hardware. A long prompt with high prefill throughput still produces a low tok/s end-to-end if decode is the bottleneck.
TTFT — time-to-first-token: TTFT, or time-to-first-token, is the wall-clock duration in milliseconds between submitting a prompt and receiving the first generated token.; On local backends, TTFT is dominated by prefill. On hosted APIs it adds queueing and network round-trip time.
p50: p50, the 50th percentile, is the median per-token decode latency in milliseconds; half of decoded tokens take less than this and half take longer.
p95: p95, the 95th percentile, is the per-token decode latency in milliseconds below which 95% of decoded tokens fall; it captures tail latency that the median hides.
batch size: batch size is the number of independent decode streams a backend runs in parallel; in the chat-short, chat-long, long-context-decay, and agent-trace workloads it is 1, while concurrent-decode sweeps batch 1, 4, 8, and 16.
context length: context length is the number of tokens of input the model attends to in a single forward pass; the long-context-decay workload measures how prefill and decode degrade at 32k, 64k, and 128k context.
prefix cache: prefix cache, also known as prompt cache, is a backend feature that stores the KV-cache state for a prompt prefix so subsequent requests sharing that prefix skip prefill on the cached portion.; llm-speed records prefix_cache_hit_rate when the backend exposes it, so agent-trace numbers are interpretable on backends that cache between turns and on backends that do not.
prefill: prefill is the phase of LLM inference in which the model processes the input prompt to populate its KV cache before any output tokens are generated.
decode: decode is the phase of LLM inference in which the model autoregressively produces output tokens, one token per forward pass, after prefill has completed.
KV cache — key/value cache: the KV cache, short for key/value cache, is the per-layer attention state a transformer accumulates during prefill so that decode can produce each new token without recomputing attention over the full prompt.; KV cache size grows linearly with context length and is the primary reason decode tok/s falls at long context.
EdDSA — Edwards-curve Digital Signature Algorithm: EdDSA, the Edwards-curve Digital Signature Algorithm, is the digital-signature scheme llm-speed uses (specifically Ed25519) to sign every benchmark upload so the bytes cannot be tampered with after signing.
JWS — JSON Web Signature (RFC 7515): JWS, JSON Web Signature defined by RFC 7515, is the compact-token format llm-speed wraps each upload in; the public key rides in the protected header so any third party can verify the signature without the server.
fingerprint: the llm-speed hardware fingerprint is a SHA-256 hash over bucketed hardware fields (CPU model, core count, RAM rounded to 8 GB, GPU name, VRAM rounded to 8 GB, OS major version) that identifies a hardware class, not a device.; Two physically different machines with the same SoC, RAM bucket, and major OS produce the same fingerprint hash. The hash is omitted entirely in --strict-anon mode.
fingerprint hash: the fingerprint hash is the 16-hex-character prefix of the llm-speed hardware fingerprint, used server-side to group runs of the same hardware class for outlier detection.
suite-v1: suite-v1 is the current versioned llm-speed workload suite, comprising chat-short, chat-long, long-context-decay, concurrent-decode, and agent-trace; the suite_version field on every result identifies which protocol produced it.; Future suite versions (suite-v2, etc.) will not invalidate older results; runs stay accessible and comparable within their own version.
chat-short: chat-short (W1) is the baseline llm-speed workload: a 128-token natural-language prompt and a 256-token completion at batch size 1.
chat-long: chat-long (W2) is the long-prompt llm-speed workload: a 4,096-token prompt assembled from fixture paragraphs and a 1,024-token completion at batch size 1, designed to surface prefill scaling.
long-context-decay: long-context-decay (W3) is the opt-in llm-speed workload that re-runs the same model at 32k, 64k, and 128k input context (where supported), each producing a 256-token completion, to capture how prefill and decode degrade with context length.
concurrent-decode: concurrent-decode (W4) is the llm-speed workload that runs multiple simultaneous decode streams at batch sizes 1, 4, 8, and 16, reporting aggregate throughput, per-stream throughput, and p50/p95 latency at each batch.; Backends without true concurrent decoding (for example, MLX which runs serially in-process) are flagged on the result so the number is not misread as parallel throughput.
agent-trace: agent-trace (W5) is the llm-speed workload that replays a fixed multi-turn coding-agent trace with realistic tool calls (read-file, write-file, run-tests, fix), growing the context to about 16k tokens by the final turn, to measure inference speed under the workload that matters for daily-driver agentic LLM use.
wall_ms: wall_ms is the end-to-end wall-clock duration of a workload in milliseconds, from request start to last token received.
model digest: the model digest is the SHA-256 prefix of the weights file used for a benchmark, recorded so that two 'Llama 3.3 70B Q4_K_M' results from different sources are not blended if one used a different quantization pipeline.
Q4_K_M: Q4_K_M is a llama.cpp / GGUF quantization scheme that stores most weights at 4 bits with k-quant blocks and selectively keeps important tensors at higher precision; it is a common balance of size and quality for local LLM inference.
MLX: MLX is Apple's Metal-native array framework for machine learning on Apple Silicon; llm-speed measures it via the mlx-lm package and reports it under the backend label 'mlx'.
llama.cpp: llama.cpp is a C/C++ runtime for transformer inference with first-class Metal, CUDA, ROCm, and CPU backends; llm-speed measures it via the llama-server / llama-cli binary on PATH and reports the build version.
vLLM: vLLM is a high-throughput inference server with PagedAttention and continuous batching, primarily targeting CUDA datacenter accelerators; llm-speed measures it via its OpenAI-compatible HTTP server or the vllm Python package.
ollama: ollama is a daemon that wraps llama.cpp with a model registry and a local REST API on port 11434; llm-speed measures it as a separate backend from llama.cpp because the surrounding harness affects TTFT.
exllamav2: exllamav2 is a CUDA-only inference library specialized for fast decoding of GPTQ/EXL2-quantized weights on consumer NVIDIA GPUs; llm-speed measures it via the exllamav2 Python package.
hosted-api: hosted-api is the llm-speed backend label for any OpenAI-compatible HTTP endpoint or the Anthropic API, including OpenAI, OpenRouter, Together, Fireworks, Groq, and Anthropic.
--strict-anon: --strict-anon is the llm-speed CLI mode in which a fresh Ed25519 keypair is generated in memory for every run, the fingerprint hash is omitted from the payload, and no User-Agent or X-LLM-Speed-Anon header is sent, so consecutive submissions from the same machine are unlinkable from the server's perspective.
--anon: --anon is the llm-speed CLI mode that sends an X-LLM-Speed-Anon: 1 header instructing the server not to associate the upload with any account, while keeping the persistent keypair and fingerprint hash for outlier-cluster grouping.
--no-upload: --no-upload is the llm-speed CLI mode that runs the workloads, saves the signed result to ~/.cache/llm-speed/runs/, and makes no network call at all; the result can be re-uploaded later with 'llm-speed bench --resume <path>'.
outlier flag: an outlier flag is a server-side marker on a benchmark result whose decode tok/s falls more than 3 sigma from the mean of its (model x hardware x backend) cluster; flagged runs remain visible and may be challenged in a public dispute thread.

Cite a specific run as llm-speed.com/r/<id>, a per-model page as llm-speed.com/m/<slug>, and a per-hardware page as llm-speed.com/hw/<slug>.