Skip to content
llm-speed

Cheapest rig that runs a 70B model comfortably

Llama-3.3-70B and Qwen2.5-72B at 4-bit need ~40 GB of memory. Here's the minimum-spec hardware that holds the model and still serves usable decode tok/s.

No data submitted for this task yet.

Run the suite to be the first benchmark for this guide:

$ pipx install llm-speed && llm-speed bench

A 70B-class dense model at 4-bit (Q4_K_M / Q4_0 / 4-bit MLX) lives in roughly 40 GB of memory once you add KV cache for a useful context window. That rules out 24 GB single-card consumer GPUs unless you're willing to split layers, and rules in: a pair of 3090s or 4090s (48 GB combined VRAM), an Apple Silicon Max or Ultra with 64 GB+ unified memory, an AMD MI300X or H100 80 GB on the high end, or AMD's Ryzen AI MAX+ 395 with 96 GB. The buy-once cost ranges from ~$1,500 (used dual 3090) to several thousand for an Ultra Mac Studio. Below is every submitted run for a 70B-or-larger dense or MoE model, ranked by decode tok/s. If your candidate isn't there yet, run the suite and submit — that row will appear next refresh.

Side-by-side comparisons

See also: All hardware · All models · Methodology