Skip to content
llm-speed

Hardware FAQ

x — frequently asked questions

Direct answers to the questions local-LLM enthusiasts ask about x, drawn from 7 signed runs on llm-speed. Every numerical claim links to a verifiable run permalink at /r/<id>.

What's the fastest LLM on x?

The fastest measured LLM on x on llm-speed is m at 10.0 decode tok/s on llama.cpp (workload chat-short, run r_0_i4fok_cfg). Cite as https://llm-speed.com/r/r_0_i4fok_cfg.

This is the headline decode tokens-per-second across every (model, backend) pairing submitted on x; faster results may exist on hardware not yet benchmarked, but among signed runs this is the published top.

Run r_0_i4fok_cfg · x leaderboard

Can I run a 7B-class model on x?

No 7B-class model run on x has been submitted to llm-speed yet, so the canonical "yes/no with a measured tok/s" answer is not currently published.

x should fit a 7B model at 4-bit quantization; submit a run with "llm-speed bench --models <hf-id>" to populate this answer.

x leaderboard

Can I run a 30B-class model on x?

No 30B-class model run on x has been submitted to llm-speed yet, so the canonical "yes/no with a measured tok/s" answer is not currently published.

x should fit a 30B model at 4-bit quantization; submit a run with "llm-speed bench --models <hf-id>" to populate this answer.

x leaderboard

Can I run a 70B-class model on x?

No 70B-class model run on x has been submitted to llm-speed yet, so the canonical "yes/no with a measured tok/s" answer is not currently published.

x should fit a 70B model at 4-bit quantization; submit a run with "llm-speed bench --models <hf-id>" to populate this answer.

x leaderboard

How does x compare to RTX 5090 for local LLM inference?

For a head-to-head between x and RTX 5090, see the side-by-side comparison page at https://llm-speed.com/vs/x-vs-rtx-5090, which lays out every (model, backend) pair where both rigs have a signed run.

As a single-rig anchor, x tops out at 10.0 decode tok/s on m via llama.cpp (run r_0_i4fok_cfg); the RTX 5090 top number is on its own /hw/<slug> page so the comparison stays grounded in measured numbers, not extrapolation.

x vs RTX 5090 (side-by-side) · RTX 5090 leaderboard · x leaderboard

Which backend is fastest on x?

The only backend with a signed local run on x so far is llama.cpp, with a top result of 10.0 decode tok/s on m (run r_0_i4fok_cfg); a multi-backend comparison on x is not yet published.

Submit a competing run with "llm-speed bench --backends <other-backend>" on a x machine to populate the comparison.

Run r_0_i4fok_cfg · x leaderboard

Is x worth it for local LLM inference in 2026?

On signed data, x delivers up to 10.0 decode tok/s on m via llama.cpp (run r_0_i4fok_cfg), which puts it near or below conversational-reading speed (~20 tok/s) for the published top configuration.

"Worth it" depends on your model class: x is most useful for 7B-class models and smaller; larger models will need quantization or a bigger rig.

For "what fits and how fast", the per-model rows on /hw/x are the honest answer; for cross-rig comparisons, see /vs/x-vs-rtx-5090.

x leaderboard · x vs RTX 5090

What quantization should I use on x?

On x, the only quant with a signed local run is Q4 at 10.0 tok/s on m (run r_0_i4fok_cfg); a multi-quant comparison on x is not yet published.

Quant choice is a quality-vs-speed tradeoff that this hardware FAQ does not arbitrate; llm-speed publishes hardware-side speed, not output quality. For quality scores, see the model card on Hugging Face and the LMSYS Chatbot Arena.

Q4: 10.0 tok/s (run r_0_i4fok_cfg)