Skip to content
llm-speed

How fast is Qwen3-Coder locally? RTX 5090, 4090, and Apple Silicon

Published 2026-07-02

Qwen3-Coder-30B-A3B is one of the strongest open coding models you can run locally: a 30B mixture-of-experts that keeps only about 3B parameters active per token. That design makes it fast on a surprising range of hardware. Here is what we measured, signed and reproducible, each platform on the backend it runs best. Every number links to its run.

Decode speed by hardware

RTX 5090: 260 tok/s. RTX 4090: 180 tok/s. Apple M4 Max: 113 tok/s. Apple M3 Ultra: 112 tok/s. The RTX 5090 leads, the 4090 is about 1.4x behind it, and both Apple chips land near 112. Full, current standings are on the cheatsheet and the model page.

Why it is quick even on a laptop-class chip

Decode speed tracks the memory traffic of the parameters that are active each token, not the model's total size. With only ~3B active, Qwen3-Coder moves far less data per token than a dense 30B, so even an Apple chip clears 100 tok/s (roughly 4x reading speed). The same effect is why small-MoE coders beat dense 32B models on the same card.

What should you run it on?

It fits comfortably at Q4 on any 24 GB or larger GPU, and on a maxed Apple. For maximum speed pick the RTX 5090; the RTX 4090 is close behind and cheaper; an Apple M-series runs it quietly at about 112. Confirm the fit for your card with the VRAM-fit checker.

Is that fast enough for a coding agent?

Comfortably. Anything above roughly 40 tok/s already outpaces reading speed, so even the Apple result near 112 tok/s streams edits faster than you can follow, and the 4090 at 180 keeps up with long agent traces that emit thousands of tokens per turn. For a model this capable, the bottleneck is rarely the decode rate. The tok/s thresholds are on the glossary.

Reproduce it, or add your rig

The same one-line install measures any of these:

$ pipx install llm-speed && llm-speed bench

Numbers as of July 2026; the linked runs and the cheatsheet always reflect current data.