Local LLM inference speed in 2026: what we've measured

Published 2026-07-02

Tokens-per-second numbers for local LLMs are everywhere — and almost none are verifiable. llm-speed publishes only signed, reproducible runs: every figure below links to the exact submission that produced it, measured under one open suite across consumer GPUs, Apple Silicon, and hosted APIs. Nobody else measures all three surfaces with a verifiable chain back to the raw run.

The fastest configs we have measured

As of July 2026, some of the quickest decoders on Apple Silicon: stable-code-instruct-3b at 192.5 tok/s on an M3 Ultra, DeepSeek-Coder-V2-Lite at 168.3 tok/s, gpt-oss-20b at 152.7 tok/s, and Qwen3-Coder-30B-A3B at 112.2 tok/s. For the current standings across every rig — including the RTX 5090 — see the cheatsheet, /hw/rtx-5090, and /hw/m3-ultra.

Why small-MoE coders punch above their weight

Qwen3-Coder-30B-A3B keeps only about 3B of its 30B parameters active per token. On the same M3 Ultra it decodes 112.2 tok/s versus 34.5 tok/s for the dense Qwen2.5-Coder-32B — roughly 3x faster for a comparable coding model, because decode speed tracks the active-parameter memory traffic, not the total parameter count. That is the architectural win a mixture-of-experts buys you on memory-bandwidth-bound hardware.

What counts as fast enough

A rule of thumb: about 10 tok/s reads slower than you can, ~60 tok/s matches typical reading speed, and 200+ tok/s is faster than you can follow. For an interactive assistant, roughly 30 tok/s and up feels responsive. Definitions of decode vs prefill tok/s and TTFT are on the glossary.

Reproduce it, or add your rig

Every number here is one command away from reproduction, and your run joins the leaderboard within seconds:

$ pipx install llm-speed && llm-speed bench

Numbers are as of July 2026; the linked run pages and the cheatsheet always reflect current data. Read how each figure is produced at /methodology.