Skip to content
llm-speed

The best model for a 512GB Mac Studio (measured)

Published 2026-07-03

A maxed M3 Ultra Mac Studio has up to 512 GB of unified memory, so it can hold almost any open model. The VRAM ceiling that limits consumer GPUs simply is not there. That flips the question: not what fits, but what runs fast enough. We measured it, signed and reproducible.

A large mixture-of-experts is the sweet spot

The standout is Qwen3-Next-80B-A3B: an 80B model that activates only about 3B parameters per token, so it decodes 80 tok/s on the M3 Ultra. That is nearly double a dense 22B (Codestral) at 47 tok/s, and roughly five times a dense 70B like Llama-3.3-70B at 17 tok/s. Total parameter count does not set decode speed; active parameters do, so a big MoE gives you frontier-class capability at conversational speed.

The practical ladder

For coding, the fast MoE coders lead: DeepSeek-Coder-V2-Lite at 168 tok/s and Qwen3-Coder-30B-A3B at 112 tok/s. For a big, capable general model, Qwen3-Next-80B at 80 tok/s is the pick. Dense 70B-class models (Llama-3.3-70B, Qwen2.5-72B at 16 tok/s) fit easily but stream slowly, so run one only when you specifically need that model.

Which should you run?

On a big-memory Mac, default to a large MoE: Qwen3-Next-80B for a general daily driver, or a Qwen3-Coder / DeepSeek-Coder MoE for coding, all at 80 tok/s or better. Reach for a dense 70B only for a specific capability, and accept the ~17 tok/s stream. See the full Apple ranking on the M3 Ultra page and how Apple compares to a top GPU in RTX 5090 vs M3 Ultra.

Reproduce it, or add your rig

The same one-line install measures any model on your Mac:

$ pipx install llm-speed && llm-speed bench

Numbers as of July 2026; the linked runs and the cheatsheet always reflect current data.