RTX 5090 vs M3 Ultra for local LLMs: speed vs memory

Published 2026-07-02

The two ways to run large models locally at speed are a flagship NVIDIA GPU or a maxed-out Apple. We measured both, each on the backend it runs best (llama.cpp on the RTX 5090, MLX on the M3 Ultra), signed and reproducible. Every number links to its run.

Head to head, same models

Decode tok/s, fastest run on each platform. stable-code-3b: RTX 5090 356 vs M3 Ultra 192 tok/s. DeepSeek-Coder-V2-Lite: 268 vs 168 tok/s. gpt-oss-20b: 302 vs 152 tok/s. And the dense Qwen2.5-Coder-32B: 71 vs 34 tok/s. The RTX 5090 decodes roughly 1.6 to 2x faster across the board. Full standings on the cheatsheet.

Speed vs memory, the real trade

Decoding is memory-bandwidth-bound, and the 5090's ~1.8 TB/s beats the M3 Ultra's ~0.8 TB/s, so the GPU wins raw speed. But the 5090 caps at 32 GB of VRAM, while an M3 Ultra offers up to 512 GB of unified memory. That lets the Mac hold models no single consumer GPU can fit (very large mixture-of-experts, or 70B-plus at higher precision), just more slowly. So the real choice is speed versus capacity, not a single winner. Why bandwidth drives decode is broken down in the GPU speed explainer.

Which should you pick?

If your models fit in 32 GB and you want maximum speed, the RTX 5090. If you need to run very large models locally, or want a quiet, power-efficient all-in-one, the M3 Ultra. Check whether a given model fits either card with the VRAM-fit checker, and compare any other pair on the head-to-head pages.

Reproduce it, or add your rig

The same one-line install measures either platform:

$ pipx install llm-speed && llm-speed bench

Numbers as of July 2026; the linked runs and the cheatsheet always reflect current data.