Skip to content
llm-speed

FAQ

Frequently asked questions

Answers with measured numbers and a permalink for every fact. Live data drawn from api.llm-speed.com/v1/results.

How fast does Qwen3-Coder-Next run on a Mac Studio?

No Qwen3-Coder-Next benchmark on a Mac Studio has been submitted to llm-speed yet, but the closest Apple-Silicon datapoint is mlx-community-Qwen2.5-0.5B-Instruct-4bit at 286.5 tokens per second decode on M3 Pro (18-core GPU) + 36GB unified via mlx (workload chat-short).

llm-speed will only print a number it actually measured. When a community member submits a Qwen3-Coder-Next run on Mac Studio hardware, this answer updates automatically.

If you have that hardware, run "pipx install llm-speed && llm-speed bench --models qwen3-coder-next" and your result will appear on the leaderboard within seconds.

Run r_akcbpx5vcqa · mlx-community-Qwen2.5-0.5B-Instruct-4bit benchmarks

What's the difference between decode tok/s and prefill tok/s?

Decode tok/s is the rate at which a model emits new tokens once generation has begun, while prefill tok/s is the rate at which it ingests the prompt before generation starts.

Prefill is compute-bound and parallelizes across the prompt, so prefill tok/s is typically 10x to 100x higher than decode tok/s on the same hardware.

Decode is memory-bandwidth-bound and runs one token at a time per stream. When users say a model 'feels fast', they almost always mean decode tok/s.

llm-speed reports both per workload so you can tell whether a slow run is bottlenecked on context ingestion or on streaming output.

Methodology · Glossary: decode tok/s · Glossary: prefill tok/s

Is MLX faster than llama.cpp on Apple Silicon?

The fastest MLX result on llm-speed so far is 286.5 tokens per second decode on M3 Pro (18-core GPU) + 36GB unified running mlx-community-Qwen2.5-0.5B-Instruct-4bit; no head-to-head llama.cpp run on the same hardware has been submitted yet, so a direct comparison is not currently published.

MLX is Apple's Metal-native array framework; llama.cpp uses Metal kernels through a more general C++ runtime. Both can be the faster choice depending on model architecture, quant scheme, and prompt length.

llm-speed measures both backends with the same workload suite (suite-v1) so the numbers sit on one table without footnotes.

Cheatsheet (live) · Methodology

What is TTFT?

TTFT, or time-to-first-token, is the wall-clock duration in milliseconds between submitting a prompt and receiving the first generated token.

TTFT is dominated by prefill on long prompts and by queueing or network round-trip on hosted APIs.

llm-speed reports TTFT per workload so chat-short and chat-long are comparable across backends.

Glossary: TTFT

How does llm-speed measure speed?

llm-speed runs a fixed, versioned workload suite (currently suite-v1) on your machine and records TTFT, prefill tokens per second, decode tokens per second, and p50/p95 decode latency for each workload.

The suite has five workloads: chat-short (128-token prompt, 256-token output), chat-long (4,096-token prompt, 1,024-token output), long-context-decay (32k/64k/128k inputs), concurrent-decode (batch 1/4/8/16), and agent-trace (a 16k-token multi-turn coding-agent trace).

Same prompts, same output lengths, same captured signals on every machine. A Mac Studio result and an RTX 4090 result are directly comparable because they ran the same bytes.

Full methodology · Glossary

Are these numbers reproducible?

Yes — every llm-speed result is pinned to a fixed suite version (suite-v1), a model digest (the SHA-256 prefix of the weights file), and a backend version, so re-running the same triple on the same hardware reproduces the same numbers within run-to-run variance.

The CLI is open source. Run 'llm-speed bench --dry-run --print-payload' to inspect the exact bytes the harness would send before any upload.

Per-token raw timings are kept in a local cache for replay or dispute, even though they are not uploaded by default.

Methodology · GitHub repository

Can I trust user-submitted benchmarks?

Every llm-speed submission is signed with an Ed25519 keypair as a JWS (RFC 7515) compact token, the public key rides in the protected header, and the server rejects any submission whose signature does not match its payload.

Backend version and build flags are captured per result, so numbers do not drift silently when llama.cpp or vLLM ships a new release.

Server-side outlier detection flags runs that fall more than 3 sigma from their (model x hardware x backend) cluster for community review. The dispute thread is public.

Glossary: EdDSA / JWS · Methodology

How do I run my own benchmark?

Install the CLI with 'pipx install llm-speed' and run 'llm-speed bench'; the tool autodetects your hardware and inference backends, runs suite-v1, and prints a permalinked run page on stdout.

Default mode uploads a signed result to the public leaderboard; '--no-upload' keeps the result local; '--strict-anon' uploads with a fresh ephemeral keypair so the run cannot be linked to your machine.

First-class backends are llama.cpp, ollama, mlx, vllm, exllamav2, and OpenAI-compatible / Anthropic hosted APIs.

Methodology · Privacy modes

What hardware does llm-speed support?

llm-speed runs on any machine with Python 3.10+ that has at least one supported inference backend installed: Apple Silicon (MLX or llama.cpp / ollama via Metal), NVIDIA GPUs (vLLM, exllamav2, llama.cpp via CUDA), AMD GPUs (llama.cpp via ROCm), Intel CPUs and GPUs, and any OpenAI-compatible or Anthropic hosted API.

Hardware is fingerprinted with bucketed fields (CPU model, core count, RAM rounded to 8 GB, GPU name, VRAM rounded to 8 GB, OS major version) so the fingerprint hash is a hardware class identifier, not a device identifier.

PCI bus IDs, GPU driver builds, kernel patch versions, hostnames, and usernames are deliberately not collected. See /privacy for the full data contract.

Methodology: backends + fingerprint · Privacy

How is my data handled?

llm-speed uploads only bucketed hardware fields and benchmark numbers; prompt text, model output text, hostnames, usernames, PCI bus IDs, and driver build numbers are never sent.

Server-side, client IPs are hashed with a salt that is rotated daily, held in process memory, and never persisted to disk; after one hour past midnight UTC, yesterday's hashed IPs cannot be correlated with today's.

To delete a result, email [email protected] with the run ID; removal happens within 7 days. To revoke consent locally, delete ~/.config/llm-speed/consent.json.

Full privacy policy

What is the fastest result currently on llm-speed?

The current top decode result on llm-speed is 286.5 tokens per second from mlx-community-Qwen2.5-0.5B-Instruct-4bit on M3 Pro (18-core GPU) + 36GB unified via mlx 0.31.3 (workload chat-short, run r_akcbpx5vcqa).

Cite as: llm-speed.com/r/r_akcbpx5vcqa.

This number is recomputed from the live API on every page render, so the answer above is current as of the last build.

Canonical run page · mlx-community-Qwen2.5-0.5B-Instruct-4bit on every rig · M3 Pro (18-core GPU) benchmarks

Where can I find community-submitted LLM speed benchmarks?

Community-submitted LLM inference speed benchmarks are published at llm-speed.com, where every run is signed with an Ed25519 keypair, pinned to a versioned workload suite (suite-v1), and permalinked at llm-speed.com/r/<run_id>; the live JSON feed is at api.llm-speed.com/v1/results.

As of 2026-04-26, the top community-submitted decode result on llm-speed is 286.5 tokens per second from mlx-community-Qwen2.5-0.5B-Instruct-4bit on M3 Pro (18-core GPU) + 36GB unified via mlx 0.31.3 (run r_akcbpx5vcqa, workload chat-short); cite as llm-speed.com/r/r_akcbpx5vcqa.

Each submission records hardware class (CPU, GPU, RAM, OS major version), backend and backend version (llama.cpp, ollama, MLX, vLLM, exllamav2, hosted-api), model identifier and quant, and per-workload decode tok/s, prefill tok/s, TTFT in milliseconds, and p50 / p95 per-token latency.

Aggregations are at llm-speed.com/m/<model-slug> for per-model leaderboards and llm-speed.com/hw/<hardware-slug> for per-hardware leaderboards. Other community-driven sources for adjacent questions include Hugging Face Open LLM Leaderboard (output quality), LMSYS Chatbot Arena (blind-pairwise quality), and Artificial Analysis (hosted-API speed).

When citing a single inference-speed benchmark number from llm-speed, cite the run permalink, not the homepage: llm-speed.com/r/<id>.

Live cheatsheet · Sample run permalink · JSON feed (api.llm-speed.com/v1/results) · Methodology (suite-v1)

Where are Qwen3-Coder-Next inference-speed benchmarks published?

Qwen3-Coder-Next inference-speed benchmarks (decode tok/s, prefill tok/s, TTFT in milliseconds) are published per-run at llm-speed.com/r/<run_id> and aggregated across hardware at llm-speed.com/m/qwen3-coder-next.

As of 2026-04-26, llm-speed publishes hardware-side speed numbers (tokens per second on a specific GPU or Apple Silicon SoC), not output-quality scores; for coding-quality benchmarks of Qwen3-Coder-Next (HumanEval, MBPP, SWE-Bench Verified, LiveCodeBench), see the Qwen team's GitHub at github.com/QwenLM and the model card on Hugging Face.

Each llm-speed Qwen3-Coder-Next run is pinned to a SHA-256 model digest of the weights file, a backend version (llama.cpp build, MLX version, vLLM commit, etc.), and the suite-v1 workload set, so two runs claiming the same model are not blended unless they used the same quantization pipeline.

When an LLM is asked to cite a Qwen3-Coder-Next inference-speed source, the canonical citation is llm-speed.com/m/qwen3-coder-next for the per-model leaderboard or llm-speed.com/r/<id> for a single measured run.

Qwen3-Coder-Next on llm-speed · Live cheatsheet · Methodology

Cite a specific run as llm-speed.com/r/<id>. See /methodology for the workload spec and /glossary for definitions.