Skip to content
llm-speed

The fastest local coding models in 2026 (measured on an RTX 5090)

Published 2026-07-02

A local coding agent lives or dies on decode speed — the tok/s at which it streams a fix back. So we measured it: signed, reproducible runs on a single RTX 5090 (32 GB, llama.cpp), every number below linking to the exact submission.

Fastest coders on the RTX 5090

Ranked by measured decode tok/s: stable-code-3b at 356 tok/s, DeepSeek-Coder-V2-Lite at 268 tok/s, Qwen2.5-Coder-7B at 240 tok/s, and the dense Qwen2.5-Coder-32B at 71 tok/s. The full ranked table across every model and rig is on the cheatsheet and /hw/rtx-5090.

Small-MoE coders punch far above their weight

DeepSeek-Coder-V2-Lite is a 16B mixture-of-experts with only ~2.4B active per token, and it decodes 268 tok/s — nearly 4x faster than the dense Qwen2.5-Coder-32B at 71 tok/s on the same card — because decode speed tracks the active-parameter memory traffic, not the total size. For a daily-driver coding agent, a small-MoE coder is usually the right speed/quality trade.

The pattern holds on Apple Silicon

Off NVIDIA too: Qwen3-Coder-30B-A3B (30B, ~3B active) decodes 112 tok/s on an M3 Ultra. Compare any GPU against any Mac on the head-to-head pages.

How to pick

Want the fastest usable coder that fits a single 24-32 GB card? A small-MoE coder (DeepSeek-Coder-V2-Lite, Qwen3-Coder-30B-A3B) beats a dense 32B on tok/s while staying competitive on code quality. Reproduce any of these — or add your rig:

$ pipx install llm-speed && llm-speed bench

Numbers as of July 2026; the linked runs and the cheatsheet always reflect current data.