Free tool

Will it run — and how fast?

Pick a model, quantization, and rig. We estimate whether it fits in your VRAM or unified memory (weights + KV cache + overhead) — and, unlike fit-only calculators, show the real measured decode tok/s from a signed run for that exact setup wherever we have one.

ModelHardwareQuantizationContext length

Pick a model and a rig above. Or browse the cheatsheet of every measured (model × hardware) cell, or the tok/s predictor.

How the estimate works

Memory ≈ weights (params × effective quant bits ÷ 8) + KV cache (fp16, scaled by params and context) + a fixed runtime overhead. For mixture-of-experts models the total parameter count drives memory — all experts must be resident — even though only a few billion are active per token (which is why they decode fast). Apple unified memory reserves ~20% for the OS. These are honest estimates; the measured tok/s is the ground truth, and every number the site publishes traces to a signed run. Read the full methodology.