Skip to content
llm-speed

Glossary

Glossary of llm-speed terms

Every term llm-speed uses when reporting a benchmark, defined in one sentence first and elaborated below. Cross-reference /methodology for workload specs and /privacy for data-handling terms.

tok/stokens per second
tok/s, short for tokens per second, is the rate at which a language model produces or consumes tokens during a benchmark.
On llm-speed, tok/s without further qualification refers to decode tok/s — the streaming-output rate users care about during chat.
decode tok/s
decode tok/s is the rate, in tokens per second, at which a model emits new tokens during generation; it is memory-bandwidth-bound and runs one token at a time per stream.
llm-speed measures decode tok/s as wall-clock output_tokens divided by decode wall time, batch size 1 unless the workload (concurrent-decode) says otherwise.
prefill tok/s
prefill tok/s is the rate, in tokens per second, at which a model ingests the prompt before generation begins; it is compute-bound and parallelizes across the prompt length.
Prefill tok/s is typically 10x to 100x higher than decode tok/s on the same hardware. A long prompt with high prefill throughput still produces a low tok/s end-to-end if decode is the bottleneck.
TTFTtime-to-first-token
TTFT, or time-to-first-token, is the wall-clock duration in milliseconds between submitting a prompt and receiving the first generated token.
On local backends, TTFT is dominated by prefill. On hosted APIs it adds queueing and network round-trip time.
p50
p50, the 50th percentile, is the median per-token decode latency in milliseconds; half of decoded tokens take less than this and half take longer.
p95
p95, the 95th percentile, is the per-token decode latency in milliseconds below which 95% of decoded tokens fall; it captures tail latency that the median hides.
batch size
batch size is the number of independent decode streams a backend runs in parallel; in the chat-short, chat-long, long-context-decay, and agent-trace workloads it is 1, while concurrent-decode sweeps batch 1, 4, 8, and 16.
context length
context length is the number of tokens of input the model attends to in a single forward pass; the long-context-decay workload measures how prefill and decode degrade at 32k, 64k, and 128k context.
prefix cache
prefix cache, also known as prompt cache, is a backend feature that stores the KV-cache state for a prompt prefix so subsequent requests sharing that prefix skip prefill on the cached portion.
llm-speed records prefix_cache_hit_rate when the backend exposes it, so agent-trace numbers are interpretable on backends that cache between turns and on backends that do not.
prefill
prefill is the phase of LLM inference in which the model processes the input prompt to populate its KV cache before any output tokens are generated.
decode
decode is the phase of LLM inference in which the model autoregressively produces output tokens, one token per forward pass, after prefill has completed.
KV cachekey/value cache
the KV cache, short for key/value cache, is the per-layer attention state a transformer accumulates during prefill so that decode can produce each new token without recomputing attention over the full prompt.
KV cache size grows linearly with context length and is the primary reason decode tok/s falls at long context.
EdDSAEdwards-curve Digital Signature Algorithm
EdDSA, the Edwards-curve Digital Signature Algorithm, is the digital-signature scheme llm-speed uses (specifically Ed25519) to sign every benchmark upload so the bytes cannot be tampered with after signing.
JWSJSON Web Signature (RFC 7515)
JWS, JSON Web Signature defined by RFC 7515, is the compact-token format llm-speed wraps each upload in; the public key rides in the protected header so any third party can verify the signature without the server.
fingerprint
the llm-speed hardware fingerprint is a SHA-256 hash over bucketed hardware fields (CPU model, core count, RAM rounded to 8 GB, GPU name, VRAM rounded to 8 GB, OS major version) that identifies a hardware class, not a device.
Two physically different machines with the same SoC, RAM bucket, and major OS produce the same fingerprint hash. The hash is omitted entirely in --strict-anon mode.
fingerprint hash
the fingerprint hash is the 16-hex-character prefix of the llm-speed hardware fingerprint, used server-side to group runs of the same hardware class for outlier detection.
suite-v1
suite-v1 is the current versioned llm-speed workload suite, comprising chat-short, chat-long, long-context-decay, concurrent-decode, and agent-trace; the suite_version field on every result identifies which protocol produced it.
Future suite versions (suite-v2, etc.) will not invalidate older results; runs stay accessible and comparable within their own version.
chat-short
chat-short (W1) is the baseline llm-speed workload: a 128-token natural-language prompt and a 256-token completion at batch size 1.
chat-long
chat-long (W2) is the long-prompt llm-speed workload: a 4,096-token prompt assembled from fixture paragraphs and a 1,024-token completion at batch size 1, designed to surface prefill scaling.
long-context-decay
long-context-decay (W3) is the opt-in llm-speed workload that re-runs the same model at 32k, 64k, and 128k input context (where supported), each producing a 256-token completion, to capture how prefill and decode degrade with context length.
concurrent-decode
concurrent-decode (W4) is the llm-speed workload that runs multiple simultaneous decode streams at batch sizes 1, 4, 8, and 16, reporting aggregate throughput, per-stream throughput, and p50/p95 latency at each batch.
Backends without true concurrent decoding (for example, MLX which runs serially in-process) are flagged on the result so the number is not misread as parallel throughput.
agent-trace
agent-trace (W5) is the llm-speed workload that replays a fixed multi-turn coding-agent trace with realistic tool calls (read-file, write-file, run-tests, fix), growing the context to about 16k tokens by the final turn, to measure inference speed under the workload that matters for daily-driver agentic LLM use.
wall_ms
wall_ms is the end-to-end wall-clock duration of a workload in milliseconds, from request start to last token received.
model digest
the model digest is the SHA-256 prefix of the weights file used for a benchmark, recorded so that two 'Llama 3.3 70B Q4_K_M' results from different sources are not blended if one used a different quantization pipeline.
Q4_K_M
Q4_K_M is a llama.cpp / GGUF quantization scheme that stores most weights at 4 bits with k-quant blocks and selectively keeps important tensors at higher precision; it is a common balance of size and quality for local LLM inference.
MLX
MLX is Apple's Metal-native array framework for machine learning on Apple Silicon; llm-speed measures it via the mlx-lm package and reports it under the backend label 'mlx'.
llama.cpp
llama.cpp is a C/C++ runtime for transformer inference with first-class Metal, CUDA, ROCm, and CPU backends; llm-speed measures it via the llama-server / llama-cli binary on PATH and reports the build version.
vLLM
vLLM is a high-throughput inference server with PagedAttention and continuous batching, primarily targeting CUDA datacenter accelerators; llm-speed measures it via its OpenAI-compatible HTTP server or the vllm Python package.
ollama
ollama is a daemon that wraps llama.cpp with a model registry and a local REST API on port 11434; llm-speed measures it as a separate backend from llama.cpp because the surrounding harness affects TTFT.
exllamav2
exllamav2 is a CUDA-only inference library specialized for fast decoding of GPTQ/EXL2-quantized weights on consumer NVIDIA GPUs; llm-speed measures it via the exllamav2 Python package.
hosted-api
hosted-api is the llm-speed backend label for any OpenAI-compatible HTTP endpoint or the Anthropic API, including OpenAI, OpenRouter, Together, Fireworks, Groq, and Anthropic.
--strict-anon
--strict-anon is the llm-speed CLI mode in which a fresh Ed25519 keypair is generated in memory for every run, the fingerprint hash is omitted from the payload, and no User-Agent or X-LLM-Speed-Anon header is sent, so consecutive submissions from the same machine are unlinkable from the server's perspective.
--anon
--anon is the llm-speed CLI mode that sends an X-LLM-Speed-Anon: 1 header instructing the server not to associate the upload with any account, while keeping the persistent keypair and fingerprint hash for outlier-cluster grouping.
--no-upload
--no-upload is the llm-speed CLI mode that runs the workloads, saves the signed result to ~/.cache/llm-speed/runs/, and makes no network call at all; the result can be re-uploaded later with 'llm-speed bench --resume <path>'.
outlier flag
an outlier flag is a server-side marker on a benchmark result whose decode tok/s falls more than 3 sigma from the mean of its (model x hardware x backend) cluster; flagged runs remain visible and may be challenged in a public dispute thread.

Cite a specific run as llm-speed.com/r/<id>, a per-model page as llm-speed.com/m/<slug>, and a per-hardware page as llm-speed.com/hw/<slug>.