Community-claimed

LLM tok/s folklore

2176 claims pulled from r/LocalLLaMA and Hacker News, covering 36 models on 50 rigs. Each row links to the original comment so you can verify it yourself. These are not signed runs — confidence is the regex extractor's best guess based on neighborhood mentions.

Trust level: community-reported. Numbers come from natural-language posts; the regex extractor pulls the nearest model + hardware mention to each tok/s value, but it can be fooled by long threads and confused parsing. For the canonical, signed leaderboard, see the main leaderboard.

By hardware

OpenRouter

communityconfidence 60%
3,000.0tok/s — gpt-oss-120b on OpenRouter via hosted-api
“-120b on openrouter.. it has NINETEEN providers configured ranging from 25 TPS to 805 TPS !! That's measured, not claimed! (they \*claim\* 3000tps lol, add \*that\* to your chart.)”
source: Reddit · u/MaxPhoenix_ · 2026-02-14
communityconfidence 60%
2,000.0tok/s — Qwen 3 Coder on OpenRouter via hosted-api
“prised the industry hasn't been as impressed with their benchmarks on token throughput. We're using the Qwen 3 Coder 480b model and seeing ~2000 tokens/second, which is easily 10-20x faster then most LLM models on the market. Even some of the fastest models still only achieve 100…”
source: HN · u/rbitar · 2025-10-01
communityconfidence 55%
330.0tok/s — on OpenRouter via hosted-api AWQ
“70.2% Setup: RTX 5090, 32GB VRAM. Qwen runs at \~50 tok/s per request, median text is 87 tokens so \~1.8s/text. Aggregate throughput \~200-330 tok/s at c=16-32. Gemma 4 26B on vLLM was too slow for production, Triton problem most probably — ended up using OpenRouter for it and …”
source: Reddit · u/Interesting_Fly_6576 · 2026-04-07

Cerebras

communityconfidence 60%
2,000.0tok/s — Llama3.3 70B on Cerebras via hosted-api
“check [https://inference.cerebras.ai/](https://inference.cerebras.ai/) \- currently they are the only provider who provides up to 2000 tokens/s for Llama3.3 70B. LLama3 405B runs at \~1000 tokens/s - [https://cerebras.ai/blog/llama-405b-inference](https://cerebras.ai/blog/llama…”
source: Reddit · u/MLDataScientist · 2025-01-02
communityconfidence 60%
2,000.0tok/s — Qwen3-Coder on Cerebras via hosted-api
“er-480B hosted by Cerebras is $2/Mtok (both input and output) through OpenRouter.OpenRouter claims Cerebras is providing at least 2000 tokens per second, which would be around 10x as fast, and the feedback I'm seeing from independent benchmarks indicates that Qwen3-Code…”
source: HN · u/coder543 · 2025-08-29
communityconfidence 60%
1,500.0tok/s — Qwen3-235B on Cerebras via hosted-api
“nches Qwen3-235B, achieving 1.5k tokens per second So, does that mean that in general for the most modern high end LLM tools, to generate ~1500 tokens per seconds you need around $500k in hardware?Checking: Anthropic charges $70 per 1 million output tokens. @1500 tokens per s…”
source: HN · u/stingraycharles · 2025-07-23
communityconfidence 60%
1,000.0tok/s — gpt-oss-120b on Cerebras via hosted-api
“Cerebras Code now supports GLM 4.6 at 1000 tokens/sec > but we literally have no way to prove they're rightOf course we do. Just run a benchmark with Cerebras/Groq and compare to the result”
source: HN · u/threeducks · 2025-11-08
communityconfidence 60%
969.0tok/s — Llama 3.1 405B on Cerebras via hosted-api
“Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference - Cerebras [link: https://cerebras.ai/blog/llama-405b-inference]”
source: Reddit · u/badgerfish2021 · 2024-11-19
communityconfidence 60%
808.0tok/s — Qwen3 32B on Cerebras via hosted-api
“ut gaps Hardware optimization creates wild performance swings:Cerebras with Qwen3 32B at 2,496 tok/s for $0.50/1M and Llama 4 Scout at 2,808 tok/s for $0.70/1M.Compare to the same models elsewhere: Often stuck at 40-80 tok/s for similar or higher prices.Arbitrage alert: …”
source: HN · u/demian101 · 2025-07-17
communityconfidence 60%
800.0tok/s — Llama 3.1-8B on Cerebras via hosted-api
“Launches the World’s Fastest AI Inference Cerebras Inference is available to users today! **Performance:** Cerebras inference delivers 1,800 tokens/sec for Llama 3.1-8B and 450 tokens/sec for Llama 3.1-70B. According to industry benchmarking firm Artificial Analysis, Cerebras I…”
source: Reddit · u/CS-fan-101 · 2024-08-27
communityconfidence 60%
560.0tok/s — Llama3.1-70B on Cerebras via hosted-api
“redd.it/v2186rtvyuqd1.png?width=1618&format=png&auto=webp&s=e259568d90417ddddc60800244b0dde3fb806709 Llama3.1-8B: 2,010 T/s Llama3.1-70B: 560 T/s”
source: Reddit · u/Barry_Jumps · 2024-09-25

Groq

communityconfidence 60%
1,665.0tok/s — Llama 3.3 70B on Groq via hosted-api
“85 tps for Llama 3.3 70B using [LPUs](https://groq.com/the-groq-lpu-explained/) Edit, Groq also mentions that they have achieved speeds of 1665 tps (!) with speculative decoding and LPUs as well ([source](https://groq.com/groq-first-generation-14nm-chip-just-got-a-6x-speed-boost…”
source: Reddit · u/TechNerd10191 · 2025-04-22
communityconfidence 60%
627.0tok/s — Qwen3 32B on Groq via hosted-api
“1M (103 tok/s) and Qwen3 235B at 0.56ms for $0.30/1M (50 tok/s).Groq takes it further with models like Qwen3 32B at 0.14ms for $0.36/1M (627 tok/s).Arbitrage alert: These blow away slower "enterprise" options costing 10x more, ideal for real-time apps3. speed demons with…”
source: HN · u/demian101 · 2025-07-17
communityconfidence 60%
200.0tok/s — Llama 3 8B on Groq via hosted-api
“Groq surpasses 1,200 tokens/sec with Llama 3 8B When reading Hacker News you develop a signal/noise filter, where lots of headlines make bold claims but you filter them o”
source: HN · u/andy_xor_andrew · 2024-05-30
communityconfidence 60%
175.0tok/s — Mistral Small on Groq via hosted-api
“input and $0.39/M output. On that much memory, you can have at least 2 instances of QwQ with a lot of context that would maybe put out 150-175 t/s with concurrency which is around $5 a day. If you are using things like Mistral Small 3.1 or Gemma 3 27, it would be quite a bit more…”
source: Reddit · u/dionysio211 · 2025-04-17

RTX 5060 Ti

communityconfidence 55%
1,200.0tok/s — on RTX 5060 Ti IQ3_XXS
“total 32G vram) Contexto 130k, kv cache f16, llama.ccp Q5_K_XL - pp ~650 t/s, tg ~16 t/s Q4_K_M - pp ~800 t/s, tg ~19 t/s IQ3_XXS - pp ~1200 t/s, tg ~25 t/s”
source: Reddit · u/Embarrassed-Boot5193 · 2026-03-11

RTX 4090

communityconfidence 55%
949.5tok/s — deepseek-r1 on RTX 4090 awq
“2x 4090 - 32b - 949.5 tokens/s with 64 cocurrent reqs (aphrodite engine deepseek-r1-distill-qwen-32b-awq --kv-cache-dtype fp8)”
source: Reddit · u/Due_Car8412 · 2025-01-21
communityconfidence 70%
665.2tok/s — Qwen3-Coder-Next on RTX 4090 via llama.cpp
“with 256k context, llama.cpp autoparser branch, windows 11. prompt eval time = 50498.86 ms / 33594 tokens ( 1.50 ms per token, 665.24 tokens per second) eval time = 2282.98 ms / 82 tokens ( 27.84 ms per token, 35.92 tokens per second) t…”
source: Reddit · u/gtrak · 2026-02-18
communityconfidence 75%
548.9tok/s — Llama-3.1-70B on RTX 4090 via vllm FP8
“neuralmagic\_Meta-Llama-3.1-70B-Instruct-FP8-dynamic Avg generation throughput: \~29-30 tokens/s Avg prompt throughput: 548.9 tokens/s (4 GPUs, 4090, power limited to 325) (8x, 8x, 4x , 4x) 5950x taichi x570 vllm backend i didn't do the specific prompt to get the va”
source: Reddit · u/I_can_see_threw_time · 2024-08-01
communityconfidence 50%
400.0tok/s — on RTX 4090 via llama.cpp
“up**: * RTX 4090 workstation running llama.cpp * OpenCode on my MacBook * 4-bit quantized model, 64K context size, \~22GB VRAM usage * \~2,400 tok/s prefill, \~40 tok/s generation Based on my testing: * It works surprisingly well and makes correct tool calling for tasks like w…”
source: Reddit · u/garg-aayush · 2026-03-30

unknown

communityconfidence 45%
787.0tok/s — Mistral Small Q8_0
“* | - Size: 15.58 GiB / 26.90B params — fully on GPU, zero CPU offload - pp plateaus at 1024 (compute saturation at ~787 t/s) - Effective memory bandwidth utilization: ~304 GB/s (~51% of 600 GB/s theoretical) **vs Vulkan (same model):**”
source: Reddit · u/Icy_Gur6890 · 2026-04-09
communityconfidence 55%
768.0tok/s — Qwen3 Coder via llama.cpp AWQ
“oughput:** **1,218 tokens/s** # llama.cpp * **Quant:** `_unsloth/Devstral-Small-2-24B-Instruct-2512-UD-Q4_K_XL.gguf_` * **Throughput:** **768 tokens/s**”
source: Reddit · u/Holiday_Purpose_3166 · 2025-12-13
communityconfidence 45%
711.0tok/s — Qwen3-8B Q8_0
“Run with --flash-attn ``` **Dense model (Qwen3-8B Q8_0) — prompt processing:** - ROCm default, no flash attn: **711 t/s** - ROCm + flash attn only: **~3,980 t/s** - **5.5× improvement from one flag** --- ## …”
source: Reddit · u/Important_Quote_1180 · 2026-03-27
communityconfidence 55%
669.0tok/s — gpt-oss-20b via vllm Q3_K_M
“32B-Instruct-GGUF_Qwen2.5-VL-32B-Instruct-Q4_K_M.gguf`|18.48 GB|626 t/s|**22.95 t/s**| |`DeepSeek-R1-Distill-Qwen-32B-Q3_K_M.gguf`|14.84 GB|669 t/s|**24.12 t/s**| |`gpt-oss-20b-F16.gguf`|12.83 GB|2620 t/s|**87.09 t/s**| |`ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf`|11.27 GB…”
source: Reddit · u/reujea0 · 2026-01-01
communityconfidence 65%
652.0tok/s — Qwen3-30B-A3B via vllm FP8
“o 6000 blackwell with vllm Power limit set to 450w **Short Context (1K tokens):** * Single user: 88.4 tok/s * 10 concurrent users: **652 tok/s** throughput * Latency: 5.65s → 7.65s (1→10 users) **Long Context (256K tokens):** * Single user: 22.0 tok/s * 10 concurrent use…”
source: Reddit · u/notaDestroyer · 2025-10-16
communityconfidence 55%
626.0tok/s — DeepSeek-R1 via vllm Q4_K_M
“-2506-UD-Q5_K_XL.gguf`|15.63 GB|861 t/s|**32.20 t/s**| |`unsloth_Qwen2.5-VL-32B-Instruct-GGUF_Qwen2.5-VL-32B-Instruct-Q4_K_M.gguf`|18.48 GB|626 t/s|**22.95 t/s**| |`DeepSeek-R1-Distill-Qwen-32B-Q3_K_M.gguf`|14.84 GB|669 t/s|**24.12 t/s**| |`gpt-oss-20b-F16.gguf`|12.83 GB|2620 t/s…”
source: Reddit · u/reujea0 · 2026-01-01
communityconfidence 45%
612.0tok/s — Llama-3.1-8B BF16
“-----|------|--------|---------|----------|------|---------|----------|------| | Llama-3.1-8B-Instruct-BF16.gguf | 16GB | 1:1 | 2,279 t/s | 612 t/s | -73% | 12.61 t/s | **18.82 t/s** | +49% | | Mistral-Small-3.2-24B-Instruct-2506-UD-Q5_K_XL.gguf | 17GB | 1:1 | 1,658 t/s | 404 t/s…”
source: Reddit · u/reujea0 · 2026-01-10
communityconfidence 45%
521.7tok/s — Mixtral 8x7b Q5_K_M
“mings: load time = 28128.85 ms llama_print_timings: sample time = 492.61 ms / 257 runs ( 1.92 ms per token, 521.71 tokens per second) llama_print_timings: prompt eval time = 5323.88 ms / 61 tokens ( 87.28 ms per token, 11.46 tokens …”
source: Reddit · u/redditaccountno6 · 2024-05-04
communityconfidence 50%
470.0tok/s — Qwen3-32B via vllm
“I just ran 2 cards directly from my motherboard's PCIE slots, pcie4.0 at 8x and I saw PP to double. E.g. Qwen3-32B PP went from 230t/s to 470t/s.”
source: Reddit · u/MLDataScientist · 2025-07-22
communityconfidence 55%
393.6tok/s — Deepseek V3 via vllm Q4
“(120k tokens): • Successful Requests: 1000 • Benchmark Duration: 439.96 seconds • Request Throughput: 2.27 req/s • Output Token Throughput: 393.61 tok/s • Total Token Throughput: 921.91 tok/s • Mean TTFT: 23549.66 ms • Mean TPOT: 700.44 ms Full benchmarks are available here: Sub…”
source: Reddit · u/Ok-Perception2973 · 2025-01-06
communityconfidence 50%
360.0tok/s — gpt-oss-120b via llama.cpp
“on large contexts. Not as bad on ROCm compared to Vulkan, but still significant, like prompt processing goes from \~900 t/s on 0 context to 360 t/s on 32K context for gpt-oss-120b (that was before recent performance degradation). For comparison, my DGX Spark suffers a bit less fr…”
source: Reddit · u/Eugr · 2025-12-15
communityconfidence 55%
212.0tok/s — Llama-3.2-1B via llama.cpp Int4
“SNR (44.8 → 19.7 dB). Dead end. * Cascade offload: ruled out by AMD docs. * Speculative decoding with Llama-3.2-1B draft (44% accept rate, 212 t/s draft): **zero effective gain**. Spec decoding not helping is the interesting one, normally a 44% accept rate would buy you somethin…”
source: Reddit · u/brandedtamarasu · 2026-03-26
communityconfidence 55%
184.6tok/s — phi4 via ollama fp16
“s|98.90 tokens/s| |fp16|55.584083ms|48.72 tokens/s| **gemma3:1b** |Quantization|Load Duration|Inference Speed| |:-|:-|:-| |q4|55.116083ms|184.62 tokens/s| |fp16|55.034792ms|135.31 tokens/s| **phi4:14b** |Quantization|Load Duration|Inference Speed| |:-|:-|:-| |q4|25.423792ms|3…”
source: Reddit · u/purealgo · 2025-03-12

RTX 5090

communityconfidence 55%
765.5tok/s — DeepSeek V3 on RTX 5090 IQ4_XS
“X8 5.0, 4090s, A6000, A40 at X4 4.0 and 3090 at X1 3.0: 665.46 t/s PP, 25.90 t/s TG * 5090s at X8/X8 5.0, 4090s, and Ampere trio at X4 4.0: 765.51 t/s PP, 26.18 t/s TG. * 5090(1) at X16 5.0, 5090(2) at X4 5.0, all the rest at X4 4.0: 940 t/s PP, 26.75 t/s TG. * 5090s at X16 5.0, …”
source: Reddit · u/panchovix · 2026-01-16
communityconfidence 55%
630.0tok/s — on RTX 5090 via vllm BF16
“nchmarks & 3 Critical Fixes for Reasoning Models Benchmarks (BF16, no quantization): \- Single: \~83 tok/s \- Batched (10 concurrent): \~630 tok/s \- TTFT: 45–60ms \- VRAM: 30.6 / 32 GB Things that bit me: \- The HuggingFace reasoning parser plugin has broken imports on vL…”
source: Reddit · u/Impressive_Tower_550 · 2026-03-15
communityconfidence 75%
555.8tok/s — Qwen3-Coder on RTX 5090 via sglang AWQ
“atency? # 1. Choosing the Framework **RTX 5090 — Qwen3-Coder-30B-A3B-Instruct-AWQ** |Metric|vLLM|SGLang| |:-|:-|:-| |Output throughput|**555.82 tok/s**|207.93 tok/s| |Mean TTFT|**549 ms**|1,558 ms| |Median TPOT|**7.06 ms**|18.84 ms| vLLM wins by 2.7x. SGLang is required `--qu…”
source: Reddit · u/NoVibeCoding · 2026-03-06
communityconfidence 60%
400.0tok/s — GPT-OSS-120b on RTX 5090 via llama.cpp
“wer, and use llama.cpp to run those same models the GPT-OSS-120b model has to offload 24 layers the the CPU using --n-cpu-moe 24, and gets ~400 t/s pp and 21 t/s generation. Which is not bad considering the load I am putting on a consumer grade memory system. GPT-OSS-20b is an…”
source: Reddit · u/unrulywind · 2025-08-23
communityconfidence 60%
400.0tok/s — gpt-oss-20b on RTX 5090 via vllm
“UDA\_ARCH=120a / Tried VLLM also * Model: gpt-oss-20b MXFP4 (GGUF) / Safetensors **The Problem:** * First 1-2 inferences work fine (\~300-400 tok/s) * On the 3rd or 4th inference, GPU hangs with "CUDA error: illegal memory access" * GPU enters error state, fans spin to 100% * n…”
source: Reddit · u/Think_Illustrator188 · 2025-12-28

H200

communityconfidence 75%
712.0tok/s — Qwen2.5-32B on H200 via vllm FP16
“p dive on vLLM quantization. Tested AWQ, GPTQ, Marlin, GGUF, and BitsandBytes on Qwen2.5-32B using an H200. Stuff we found: * Marlin hits 712 tok/s, baseline FP16 does 461. Quantized and faster. * GPTQ without Marlin kernel is actually slower than FP16 (276 tok/s) * BitsandByte…”
source: Reddit · u/LayerHot · 2026-01-09
communityconfidence 50%
398.0tok/s — on H200 via vllm
“set. \- Single user: 207 tok/s, 35ms TTFT \- At 32 concurrent users: 2,267 tok/s, 85ms TTFT \- Peak throughput (no concurrency limit): 4,398 tok/s All of the benchmarks were done with [vLLM benchmark CLI](https://docs.vllm.ai/en/latest/benchmarking/cli/) Full numbers: …”
source: Reddit · u/LayerHot · 2026-01-20

A100

communityconfidence 75%
700.0tok/s — LLaMA-3 70B on A100 via vllm FP16
“Decoding speed is nuts—up to 1.8x more requests/sec than vLLM on an A100. TurboMind pushes 4-bit inference 2.4x faster than FP16, hitting \~700 tokens/sec at 100 users (LLaMA-3 70B Q4). Low latency (40-60ms), easy one-command server setup, and it even handles multi-round chats ef…”
source: Reddit · u/juliensalinas · 2025-04-08

RTX 3090

communityconfidence 60%
700.0tok/s — Qwen3 30B-A3B on RTX 3090 via sglang
“it. I was also really excited nearly a year ago for Qwen3 30B-A3B, and I had gotten it running quite fast on my 3090s (150tok/s single and 700tok/s batched per 3090, though i hadn't tested long context) and then I abjectly failed to come up with a use case for it, I acquired a 50…”
source: Reddit · u/michaelsoft__binbows · 2026-02-28
communityconfidence 45%
693.2tok/s — on RTX 3090 fp8
“90s Speed : 79.4 tok/s \--- Concurrent Generation (n=20) --- Total tokens : 20480 Wall time : 29.54s Throughput : 693.2 tok/s (aggregate) \--- Prefill / TTFT (target \~8000 input tokens) --- Input : 15381 tokens (from server) TTFT : 7053 …”
source: Reddit · u/Conscious_Cut_6144 · 2026-03-11
communityconfidence 70%
600.0tok/s — Llama 3.3 70b on RTX 3090 via exllamav2
“I run 2*3090 on ExLlamaV2 for Llama 3.3 70b at 4.5bpw with 32k context with tensor parallel at 600tok/s prompt ingestion and 30tok/s for generation, all for $1.5k thanks to ebay deals. Heck you can speed things even more with 4.0bpw + speculat”
source: Reddit · u/TyraVex · 2025-02-16
communityconfidence 60%
399.1tok/s — deepseek-r1 on RTX 3090 via ollama
“----------------------------------- Model: deepseek-r1:32b Performance Metrics: Prompt Processing: 399.05 tokens/sec Generation Speed: 27.18 tokens/sec Combined Speed: 27.58 tokens/sec …”
source: Reddit · u/ChemicalExcellent463 · 2025-02-17
communityconfidence 60%
187.0tok/s — Qwen3-Coder-Next on RTX 3090 via ollama
“Next UD-Q4\_K\_XS via llama-server (Qwen3.5 under test as I type this) \- Layer split across both GPUs (PHB interconnect, no NVLink) \- \~187 tok/s prompt processing, \~81 tok/s generation The agent talks to any OpenAI-compatible endpoint, so it works with llama-server, Ollama…”
source: Reddit · u/fabiononato · 2026-03-07

B200

communityconfidence 55%
651.7tok/s — Qwen3-Coder on B200 FP8
“he largest gap on the most communication-heavy workload – **GLM-4.6-FP8 (8-way TP):** B200 is **4.87x** faster than PRO 6000 (8,036.71 vs 1,651.67 tok/s) – **Qwen3-Coder-480B (4-way TP):** B200 is **4.02x** faster than PRO 6000 (6,438.43 vs 1,602.96 tok/s) – **GLM-4.5-Air (single…”
source: Reddit · u/NoVibeCoding · 2026-02-11

SambaNova

communityconfidence 60%
600.0tok/s — Qwen 2.5 32b on SambaNova via hosted-api
“aNova, Cerebras and Grok have hardware that may be able to run it. Outside of that, don't think so. SambaNova has Qwen 2.5 32b Coder at 400-600 t/s.”
source: Reddit · u/FullOf_Bad_Ideas · 2025-01-02

Together AI

communityconfidence 60%
588.0tok/s — GPT-OSS-120B on Together AI via hosted-api
“fected regimes. We wrote up the details at cumulus.blog/ionattention.On multimodal pipelines we get better performance than big players (588 tok/s vs. Together AI's 298 on the same VLM workload). We're honest that p50 latency is currently worse (~1.46s vs. 0.74s) — that's the …”
source: HN · u/vshah1016 · 2026-03-12
communityconfidence 70%
200.0tok/s — Gemma3-27b on Together AI via hosted-api
“s rocking for a laptop. 5k in 32 sec. is about 150 t/sec. I got an AMD guy to run some numbers on their new unified chip and he was showing 200 t/sec, but with a smaller 7b model. A larger model would have surely slowed him down. I run Gemma3-27b in IQ4-XS on an RTX 4770 ti a…”
source: Reddit · u/unrulywind · 2025-04-11

RTX 3090 Ti

communityconfidence 60%
500.0tok/s — Llama 3 8B on RTX 3090 Ti via vllm
“10 mins I can go through 24.9M prompt tokens and 1.69M generated tokens on 2x rtx 3090 Ti and Llama 3 8B W8A8 in vLLM 0.8.3 V1, so about 41,500 t/s input and 2800 t/s output, both at the same time. I went through like 8B+ tokens on single card this way before I put in the second …”
source: Reddit · u/FullOf_Bad_Ideas · 2025-04-16

H100

communityconfidence 60%
400.0tok/s — Llama 3.1 8B on H100 via sglang
“raining to get EAGLE draft models - if you're optimizing a production workload it will probably be worth it tough - the EAGLE team reported 400 TPS for a Llama 3.1 8B model on a single H100, that's bonkers: [https://x.com/hongyangzh/status/1903109123895341536](https://x.com/hongy…”
source: Reddit · u/randomfoo2 · 2025-03-25
communityconfidence 60%
170.0tok/s — Mistral 7B on H100 via llama.cpp
“/www.baseten.co/blog/benchmarking-fast-mistral-7b-inf...</a> benchmarks an H100 on Mistral 7B, and claims different providers range from 38-170 TPS depending on their proprietary configurations.A used H100 server is roughly $80k according to google. H100 power draw alone is 70…”
source: HN · u/vessenes · 2024-05-20

MI300X

communityconfidence 45%
370.0tok/s — on MI300X FP16
“on) vs H100 * Bandwidth: 5.2 TB/s (faster than your desk llama can spit) H100: * Price: $28,000 (approximately one kidney) * Performance: 370 tokens/s/GPU (FP16), but it doesn't fit into one. * Memory: 80GB (MI300X has almost 2.5 times more VRAM!!) Key Points: 1. H100 is \~4.…”
source: Reddit · u/No_Training9444 · 2024-06-25
communityconfidence 55%
353.0tok/s — llama-2-70b on MI300X FP16
“85c28e456902a874152d145e2ea3f4d28f84ebf8 After some calculations: MI300X: * Price: $15,000 (or 1.5 million alpaca tokens) * Performance: 353 tokens/s/GPU (FP16) * Memory: 192GB HBM3 (that's a lot of context for your LLM to chew on) vs H100 * Bandwidth: 5.2 TB/s (faster than yo…”
source: Reddit · u/No_Training9444 · 2024-06-25

RTX 5060

communityconfidence 55%
292.0tok/s — on RTX 5060 via vllm FP8
“zation-goal balanced # Output tells you: # - Recommended precision: FP8 # - Batch size: 128 # - Expected throughput: 6,292 tokens/sec # - Memory usage: 15.2GB / 24GB # - Plus full vLLM config you can copy-paste **Validation:** Tested on my RTX …”
source: Reddit · u/1Hesham · 2025-11-18

Replicate

communityconfidence 60%
264.0tok/s — Qwen3 30B-A3B on Replicate via hosted-api
“it's now 605 t/s. Similarly, Llama 4 Scout has a pp512 of 103 t/s, and is now 173 t/s, although the HIP backend is significantly faster at 264 t/s. Unlike last time, I won't be taking any model testing requests as these sweeps take quite a while to run - I feel like there are en…”
source: Reddit · u/randomfoo2 · 2025-07-22

Radeon RX 9070 XT

communityconfidence 55%
200.0tok/s — on Radeon RX 9070 XT via llama.cpp FP16
“~778 TFLOPS FP4 Compute, and I had 41 TFLOPS FP16 Compute, running Qwen 3.5 27b at similar quantizations (Q4 something vs. IQ3-XXS) he had ~200 TPS PP and ~20 TPS TG to memory, which was near identical to what I get on llama.cpp Vulkan inference of Qwen 3.5 27b IQ3-XXS. Intuiti…”
source: Reddit · u/Qwen30bEnjoyer · 2026-04-02
communityconfidence 60%
180.0tok/s — gpt oss 20b on Radeon RX 9070 XT via llama.cpp
“hanics) gpt OSs 20v won by a metric mile. It did run faster than gpt OSs though on llama.cpp (9070xt) with gpt OSs at 140 tok/s and lfm at 180 tok/s”
source: Reddit · u/ConversationOver9445 · 2026-02-25

M4

communityconfidence 60%
180.0tok/s — Qwen3-32B on M4 via lm-studio
“its 79.51% while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups. 5️⃣ The 0.6B micro-model races above 180 tok/s but tops out at 37.56% - that's why it's not even on the graph (50 % performance cut-off). All local runs were done with @lmstudi…”
source: Reddit · u/ResearchCrafty1804 · 2025-05-07

How this list is built

A daily crawler queries r/LocalLLaMA, r/Oobabooga, r/Ollama, r/MachineLearning, and HN for posts mentioning tok/s + GPU/Apple silicon. A regex extractor finds every (model × hardware × tok/s) tuple by nearest-neighbor matching within a paragraph, then sanity filters drop obvious prompt-eval numbers, hardware-class outliers, and low-confidence parses.

Folklore is shown alongside signed runs on model and hardware pages but never replaces them. If you have a signed run for a config surfaced here, your number wins. Submit one with pipx install llm-speed && llm-speed bench.

Total community claims indexed: 2176.