Hardware FAQ
Pentest-Bench — frequently asked questions
Direct answers to the questions local-LLM enthusiasts ask about Pentest-Bench, drawn from 12 signed runs on llm-speed. Every numerical claim links to a verifiable run permalink at /r/<id>.
What's the fastest LLM on Pentest-Bench?
The fastest measured LLM on Pentest-Bench on llm-speed is <script>alert(1)</script> at 42.0 decode tok/s on llama.cpp b9999 (workload chat-short, run r_a3ei8og3rkg). Cite as https://llm-speed.com/r/r_a3ei8og3rkg.
This is the headline decode tokens-per-second across every (model, backend) pairing submitted on Pentest-Bench; faster results may exist on hardware not yet benchmarked, but among signed runs this is the published top.
Can I run a 7B-class model on Pentest-Bench?
No 7B-class model run on Pentest-Bench has been submitted to llm-speed yet, so the canonical "yes/no with a measured tok/s" answer is not currently published.
Pentest-Bench should fit a 7B model at 4-bit quantization; submit a run with "llm-speed bench --models <hf-id>" to populate this answer.
Can I run a 30B-class model on Pentest-Bench?
No 30B-class model run on Pentest-Bench has been submitted to llm-speed yet, so the canonical "yes/no with a measured tok/s" answer is not currently published.
Pentest-Bench should fit a 30B model at 4-bit quantization; submit a run with "llm-speed bench --models <hf-id>" to populate this answer.
Can I run a 70B-class model on Pentest-Bench?
No 70B-class model run on Pentest-Bench has been submitted to llm-speed yet, so the canonical "yes/no with a measured tok/s" answer is not currently published.
Pentest-Bench should fit a 70B model at 4-bit quantization; submit a run with "llm-speed bench --models <hf-id>" to populate this answer.
How does Pentest-Bench compare to RTX 5090 for local LLM inference?
For a head-to-head between Pentest-Bench and RTX 5090, see the side-by-side comparison page at https://llm-speed.com/vs/pentest-bench-vs-rtx-5090, which lays out every (model, backend) pair where both rigs have a signed run.
As a single-rig anchor, Pentest-Bench tops out at 42.0 decode tok/s on <script>alert(1)</script> via llama.cpp (run r_a3ei8og3rkg); the RTX 5090 top number is on its own /hw/<slug> page so the comparison stays grounded in measured numbers, not extrapolation.
Pentest-Bench vs RTX 5090 (side-by-side) · RTX 5090 leaderboard · Pentest-Bench leaderboard
Which backend is fastest on Pentest-Bench?
The only backend with a signed local run on Pentest-Bench so far is llama.cpp, with a top result of 42.0 decode tok/s on <script>alert(1)</script> (run r_a3ei8og3rkg); a multi-backend comparison on Pentest-Bench is not yet published.
Submit a competing run with "llm-speed bench --backends <other-backend>" on a Pentest-Bench machine to populate the comparison.
Is Pentest-Bench worth it for local LLM inference in 2026?
On signed data, Pentest-Bench delivers up to 42.0 decode tok/s on <script>alert(1)</script> via llama.cpp (run r_a3ei8og3rkg), which puts it above conversational-reading speed (~20 tok/s) but below interactive-coding thresholds for the published top configuration.
"Worth it" depends on your model class: Pentest-Bench is most useful for 7B-class models and smaller; larger models will need quantization or a bigger rig.
For "what fits and how fast", the per-model rows on /hw/pentest-bench are the honest answer; for cross-rig comparisons, see /vs/pentest-bench-vs-rtx-5090.
What quantization should I use on Pentest-Bench?
On Pentest-Bench, the only quant with a signed local run is Q4_K_M at 42.0 tok/s on <script>alert(1)</script> (run r_a3ei8og3rkg); a multi-quant comparison on Pentest-Bench is not yet published.
Quant choice is a quality-vs-speed tradeoff that this hardware FAQ does not arbitrate; llm-speed publishes hardware-side speed, not output quality. For quality scores, see the model card on Hugging Face and the LMSYS Chatbot Arena.