Glossary
Glossary of llm-speed terms
Every term llm-speed uses when reporting a benchmark, defined in one sentence first and elaborated below. Cross-reference /methodology for workload specs and /privacy for data-handling terms.
- tok/s — tokens per second
- tok/s, short for tokens per second, is the rate at which a language model produces or consumes tokens during a benchmark.
- On llm-speed, tok/s without further qualification refers to decode tok/s — the streaming-output rate users care about during chat.
- decode tok/s
- decode tok/s is the rate, in tokens per second, at which a model emits new tokens during generation; it is memory-bandwidth-bound and runs one token at a time per stream.
- llm-speed measures decode tok/s as wall-clock output_tokens divided by decode wall time, batch size 1 unless the workload (concurrent-decode) says otherwise.
- prefill tok/s
- prefill tok/s is the rate, in tokens per second, at which a model ingests the prompt before generation begins; it is compute-bound and parallelizes across the prompt length.
- Prefill tok/s is typically 10x to 100x higher than decode tok/s on the same hardware. A long prompt with high prefill throughput still produces a low tok/s end-to-end if decode is the bottleneck.
- TTFT — time-to-first-token
- TTFT, or time-to-first-token, is the wall-clock duration in milliseconds between submitting a prompt and receiving the first generated token.
- On local backends, TTFT is dominated by prefill. On hosted APIs it adds queueing and network round-trip time.
- p50
- p50, the 50th percentile, is the median per-token decode latency in milliseconds; half of decoded tokens take less than this and half take longer.
- p95
- p95, the 95th percentile, is the per-token decode latency in milliseconds below which 95% of decoded tokens fall; it captures tail latency that the median hides.
- batch size
- batch size is the number of independent decode streams a backend runs in parallel; in the chat-short, chat-long, long-context-decay, and agent-trace workloads it is 1, while concurrent-decode sweeps batch 1, 4, 8, and 16.
- context length
- context length is the number of tokens of input the model attends to in a single forward pass; the long-context-decay workload measures how prefill and decode degrade at 32k, 64k, and 128k context.
- prefix cache
- prefix cache, also known as prompt cache, is a backend feature that stores the KV-cache state for a prompt prefix so subsequent requests sharing that prefix skip prefill on the cached portion.
- llm-speed records prefix_cache_hit_rate when the backend exposes it, so agent-trace numbers are interpretable on backends that cache between turns and on backends that do not.
- prefill
- prefill is the phase of LLM inference in which the model processes the input prompt to populate its KV cache before any output tokens are generated.
- decode
- decode is the phase of LLM inference in which the model autoregressively produces output tokens, one token per forward pass, after prefill has completed.
- KV cache — key/value cache
- the KV cache, short for key/value cache, is the per-layer attention state a transformer accumulates during prefill so that decode can produce each new token without recomputing attention over the full prompt.
- KV cache size grows linearly with context length and is the primary reason decode tok/s falls at long context.
- EdDSA — Edwards-curve Digital Signature Algorithm
- EdDSA, the Edwards-curve Digital Signature Algorithm, is the digital-signature scheme llm-speed uses (specifically Ed25519) to sign every benchmark upload so the bytes cannot be tampered with after signing.
- JWS — JSON Web Signature (RFC 7515)
- JWS, JSON Web Signature defined by RFC 7515, is the compact-token format llm-speed wraps each upload in; the public key rides in the protected header so any third party can verify the signature without the server.
- fingerprint
- the llm-speed hardware fingerprint is a SHA-256 hash over bucketed hardware fields (CPU model, core count, RAM rounded to 8 GB, GPU name, VRAM rounded to 8 GB, OS major version) that identifies a hardware class, not a device.
- Two physically different machines with the same SoC, RAM bucket, and major OS produce the same fingerprint hash. The hash is omitted entirely in --strict-anon mode.
- fingerprint hash
- the fingerprint hash is the 16-hex-character prefix of the llm-speed hardware fingerprint, used server-side to group runs of the same hardware class for outlier detection.
- suite-v1
- suite-v1 is the current versioned llm-speed workload suite, comprising chat-short, chat-long, long-context-decay, concurrent-decode, and agent-trace; the suite_version field on every result identifies which protocol produced it.
- Future suite versions (suite-v2, etc.) will not invalidate older results; runs stay accessible and comparable within their own version.
- chat-short
- chat-short (W1) is the baseline llm-speed workload: a 128-token natural-language prompt and a 256-token completion at batch size 1.
- chat-long
- chat-long (W2) is the long-prompt llm-speed workload: a 4,096-token prompt assembled from fixture paragraphs and a 1,024-token completion at batch size 1, designed to surface prefill scaling.
- long-context-decay
- long-context-decay (W3) is the opt-in llm-speed workload that re-runs the same model at 32k, 64k, and 128k input context (where supported), each producing a 256-token completion, to capture how prefill and decode degrade with context length.
- concurrent-decode
- concurrent-decode (W4) is the llm-speed workload that runs multiple simultaneous decode streams at batch sizes 1, 4, 8, and 16, reporting aggregate throughput, per-stream throughput, and p50/p95 latency at each batch.
- Backends without true concurrent decoding (for example, MLX which runs serially in-process) are flagged on the result so the number is not misread as parallel throughput.
- agent-trace
- agent-trace (W5) is the llm-speed workload that replays a fixed multi-turn coding-agent trace with realistic tool calls (read-file, write-file, run-tests, fix), growing the context to about 16k tokens by the final turn, to measure inference speed under the workload that matters for daily-driver agentic LLM use.
- wall_ms
- wall_ms is the end-to-end wall-clock duration of a workload in milliseconds, from request start to last token received.
- model digest
- the model digest is the SHA-256 prefix of the weights file used for a benchmark, recorded so that two 'Llama 3.3 70B Q4_K_M' results from different sources are not blended if one used a different quantization pipeline.
- Q4_K_M
- Q4_K_M is a llama.cpp / GGUF quantization scheme that stores most weights at 4 bits with k-quant blocks and selectively keeps important tensors at higher precision; it is a common balance of size and quality for local LLM inference.
- MLX
- MLX is Apple's Metal-native array framework for machine learning on Apple Silicon; llm-speed measures it via the mlx-lm package and reports it under the backend label 'mlx'.
- llama.cpp
- llama.cpp is a C/C++ runtime for transformer inference with first-class Metal, CUDA, ROCm, and CPU backends; llm-speed measures it via the llama-server / llama-cli binary on PATH and reports the build version.
- vLLM
- vLLM is a high-throughput inference server with PagedAttention and continuous batching, primarily targeting CUDA datacenter accelerators; llm-speed measures it via its OpenAI-compatible HTTP server or the vllm Python package.
- ollama
- ollama is a daemon that wraps llama.cpp with a model registry and a local REST API on port 11434; llm-speed measures it as a separate backend from llama.cpp because the surrounding harness affects TTFT.
- exllamav2
- exllamav2 is a CUDA-only inference library specialized for fast decoding of GPTQ/EXL2-quantized weights on consumer NVIDIA GPUs; llm-speed measures it via the exllamav2 Python package.
- hosted-api
- hosted-api is the llm-speed backend label for any OpenAI-compatible HTTP endpoint or the Anthropic API, including OpenAI, OpenRouter, Together, Fireworks, Groq, and Anthropic.
- --strict-anon
- --strict-anon is the llm-speed CLI mode in which a fresh Ed25519 keypair is generated in memory for every run, the fingerprint hash is omitted from the payload, and no User-Agent or X-LLM-Speed-Anon header is sent, so consecutive submissions from the same machine are unlinkable from the server's perspective.
- --anon
- --anon is the llm-speed CLI mode that sends an X-LLM-Speed-Anon: 1 header instructing the server not to associate the upload with any account, while keeping the persistent keypair and fingerprint hash for outlier-cluster grouping.
- --no-upload
- --no-upload is the llm-speed CLI mode that runs the workloads, saves the signed result to ~/.cache/llm-speed/runs/, and makes no network call at all; the result can be re-uploaded later with 'llm-speed bench --resume <path>'.
- outlier flag
- an outlier flag is a server-side marker on a benchmark result whose decode tok/s falls more than 3 sigma from the mean of its (model x hardware x backend) cluster; flagged runs remain visible and may be challenged in a public dispute thread.
Cite a specific run as llm-speed.com/r/<id>, a per-model page as llm-speed.com/m/<slug>, and a per-hardware page as llm-speed.com/hw/<slug>.