Question 1

What's the fastest LLM on x?

Accepted Answer

The fastest measured LLM on x on llm-speed is m at 10.0 decode tok/s on llama.cpp (workload chat-short, run r_0_i4fok_cfg). Cite as https://llm-speed.com/r/r_0_i4fok_cfg. This is the headline decode tokens-per-second across every (model, backend) pairing submitted on x; faster results may exist on hardware not yet benchmarked, but among signed runs this is the published top.

Question 2

Can I run a 7B-class model on x?

Accepted Answer

No 7B-class model run on x has been submitted to llm-speed yet, so the canonical "yes/no with a measured tok/s" answer is not currently published. x should fit a 7B model at 4-bit quantization; submit a run with "llm-speed bench --models <hf-id>" to populate this answer.

Question 3

Can I run a 30B-class model on x?

Accepted Answer

No 30B-class model run on x has been submitted to llm-speed yet, so the canonical "yes/no with a measured tok/s" answer is not currently published. x should fit a 30B model at 4-bit quantization; submit a run with "llm-speed bench --models <hf-id>" to populate this answer.

Question 4

Can I run a 70B-class model on x?

Accepted Answer

No 70B-class model run on x has been submitted to llm-speed yet, so the canonical "yes/no with a measured tok/s" answer is not currently published. x should fit a 70B model at 4-bit quantization; submit a run with "llm-speed bench --models <hf-id>" to populate this answer.

Question 5

How does x compare to RTX 5090 for local LLM inference?

Accepted Answer

For a head-to-head between x and RTX 5090, see the side-by-side comparison page at https://llm-speed.com/vs/x-vs-rtx-5090, which lays out every (model, backend) pair where both rigs have a signed run. As a single-rig anchor, x tops out at 10.0 decode tok/s on m via llama.cpp (run r_0_i4fok_cfg); the RTX 5090 top number is on its own /hw/<slug> page so the comparison stays grounded in measured numbers, not extrapolation.

Question 6

Which backend is fastest on x?

Accepted Answer

The only backend with a signed local run on x so far is llama.cpp, with a top result of 10.0 decode tok/s on m (run r_0_i4fok_cfg); a multi-backend comparison on x is not yet published. Submit a competing run with "llm-speed bench --backends <other-backend>" on a x machine to populate the comparison.

Question 7

Is x worth it for local LLM inference in 2026?

Accepted Answer

On signed data, x delivers up to 10.0 decode tok/s on m via llama.cpp (run r_0_i4fok_cfg), which puts it near or below conversational-reading speed (~20 tok/s) for the published top configuration. "Worth it" depends on your model class: x is most useful for 7B-class models and smaller; larger models will need quantization or a bigger rig. For "what fits and how fast", the per-model rows on /hw/x are the honest answer; for cross-rig comparisons, see /vs/x-vs-rtx-5090.

Question 8

What quantization should I use on x?

Accepted Answer

On x, the only quant with a signed local run is Q4 at 10.0 tok/s on m (run r_0_i4fok_cfg); a multi-quant comparison on x is not yet published. Quant choice is a quality-vs-speed tradeoff that this hardware FAQ does not arbitrate; llm-speed publishes hardware-side speed, not output quality. For quality scores, see the model card on Hugging Face and the LMSYS Chatbot Arena.