Skip to content
llm-speed

How fast is Qwen3.6-27B on an RTX 4090?

Published 2026-07-03

Qwen3.6-27B is a standout 2026 release: a dense 27B that is multimodal, carries a long context window, and posts top-tier coding scores while fitting a single 24 GB card at Q4. That makes it a natural pick for one mainstream GPU, so the question is how fast it actually decodes. We measured it, signed and reproducible. Every number links to its run.

Decode speed: 4090 and 5090

On a 24 GB RTX 4090, Qwen3.6-27B decodes 44 tok/s at Q4, with room for context. On an RTX 5090 it reaches 74 tok/s. It is a dense model, so the rate is moderate rather than blazing, but 44 tok/s is comfortably above reading speed and usable for a coding agent on a single 4090.

The faster MoE sibling

Qwen3.6 also ships a 35B-A3B mixture-of-experts that activates only about 3B parameters per token. On an RTX 5090 it decodes 224 tok/s, roughly three times the dense 27B, because decode speed tracks active-parameter memory traffic, not total size. The same effect shows up across the fastest local coders.

Which should you run?

Pick the dense 27B when you want its specific coding quality on a single 24 GB card and 44 to 74 tok/s is enough. Pick the 35B-A3B MoE when you want raw speed. Check the fit for your card with the VRAM-fit checker, and compare against other coders in Codestral vs Qwen3-Coder.

Reproduce it, or add your rig

The same one-line install measures either variant:

$ pipx install llm-speed && llm-speed bench

Numbers as of July 2026; the linked runs and the cheatsheet always reflect current data.