Skip to content
llm-speed

Community-claimed

LLM tok/s folklore

2176 claims pulled from r/LocalLLaMA and Hacker News, covering 36 models on 50 rigs. Each row links to the original comment so you can verify it yourself. These are not signed runs — confidence is the regex extractor's best guess based on neighborhood mentions.

Trust level: community-reported. Numbers come from natural-language posts; the regex extractor pulls the nearest model + hardware mention to each tok/s value, but it can be fooled by long threads and confused parsing. For the canonical, signed leaderboard, see the main leaderboard.

By hardware

OpenRouter

  • communityconfidence 60%

    3,000.0tok/s gpt-oss-120b on OpenRouter via hosted-api

    -120b on openrouter.. it has NINETEEN providers configured ranging from 25 TPS to 805 TPS !! That's measured, not claimed! (they \*claim\* 3000tps lol, add \*that\* to your chart.)

    source: Reddit · u/MaxPhoenix_ · 2026-02-14

  • communityconfidence 60%

    2,000.0tok/s Qwen 3 Coder on OpenRouter via hosted-api

    prised the industry hasn't been as impressed with their benchmarks on token throughput. We're using the Qwen 3 Coder 480b model and seeing ~2000 tokens/second, which is easily 10-20x faster then most LLM models on the market. Even some of the fastest models still only achieve 100…

    source: HN · u/rbitar · 2025-10-01

  • communityconfidence 55%

    330.0tok/s on OpenRouter via hosted-api AWQ

    70.2% Setup: RTX 5090, 32GB VRAM. Qwen runs at \~50 tok/s per request, median text is 87 tokens so \~1.8s/text. Aggregate throughput \~200-330 tok/s at c=16-32. Gemma 4 26B on vLLM was too slow for production, Triton problem most probably — ended up using OpenRouter for it and …

    source: Reddit · u/Interesting_Fly_6576 · 2026-04-07

Cerebras

  • communityconfidence 60%

    2,000.0tok/s Llama3.3 70B on Cerebras via hosted-api

    check [https://inference.cerebras.ai/](https://inference.cerebras.ai/) \- currently they are the only provider who provides up to 2000 tokens/s for Llama3.3 70B. LLama3 405B runs at \~1000 tokens/s - [https://cerebras.ai/blog/llama-405b-inference](https://cerebras.ai/blog/llama…

    source: Reddit · u/MLDataScientist · 2025-01-02

  • communityconfidence 60%

    2,000.0tok/s Qwen3-Coder on Cerebras via hosted-api

    er-480B <i>hosted by Cerebras</i> is $2/Mtok (both input and output) through OpenRouter.<p>OpenRouter claims Cerebras is providing at least 2000 tokens per second, which would be around 10x as fast, and the feedback I'm seeing from independent benchmarks indicates that Qwen3-Code…

    source: HN · u/coder543 · 2025-08-29

  • communityconfidence 60%

    1,500.0tok/s Qwen3-235B on Cerebras via hosted-api

    nches Qwen3-235B, achieving 1.5k tokens per second So, does that mean that in general for the most modern high end LLM tools, to generate ~1500 tokens per seconds you need around $500k in hardware?<p>Checking: Anthropic charges $70 per 1 million output tokens. @1500 tokens per s…

    source: HN · u/stingraycharles · 2025-07-23

  • communityconfidence 60%

    1,000.0tok/s gpt-oss-120b on Cerebras via hosted-api

    Cerebras Code now supports GLM 4.6 at 1000 tokens/sec > but we literally have no way to prove they're right<p>Of course we do. Just run a benchmark with Cerebras/Groq and compare to the result

    source: HN · u/threeducks · 2025-11-08

  • communityconfidence 60%

    969.0tok/s Llama 3.1 405B on Cerebras via hosted-api

    Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference - Cerebras [link: https://cerebras.ai/blog/llama-405b-inference]

    source: Reddit · u/badgerfish2021 · 2024-11-19

  • communityconfidence 60%

    808.0tok/s Qwen3 32B on Cerebras via hosted-api

    ut gaps Hardware optimization creates wild performance swings:<p>Cerebras with Qwen3 32B at 2,496 tok/s for $0.50/1M and Llama 4 Scout at 2,808 tok/s for $0.70/1M.<p>Compare to the same models elsewhere: Often stuck at 40-80 tok/s for similar or higher prices.<p>Arbitrage alert: …

    source: HN · u/demian101 · 2025-07-17

  • communityconfidence 60%

    800.0tok/s Llama 3.1-8B on Cerebras via hosted-api

    Launches the World’s Fastest AI Inference Cerebras Inference is available to users today! **Performance:** Cerebras inference delivers 1,800 tokens/sec for Llama 3.1-8B and 450 tokens/sec for Llama 3.1-70B. According to industry benchmarking firm Artificial Analysis, Cerebras I…

    source: Reddit · u/CS-fan-101 · 2024-08-27

  • communityconfidence 60%

    560.0tok/s Llama3.1-70B on Cerebras via hosted-api

    redd.it/v2186rtvyuqd1.png?width=1618&format=png&auto=webp&s=e259568d90417ddddc60800244b0dde3fb806709 Llama3.1-8B: 2,010 T/s Llama3.1-70B: 560 T/s

    source: Reddit · u/Barry_Jumps · 2024-09-25

Groq

  • communityconfidence 60%

    1,665.0tok/s Llama 3.3 70B on Groq via hosted-api

    85 tps for Llama 3.3 70B using [LPUs](https://groq.com/the-groq-lpu-explained/) Edit, Groq also mentions that they have achieved speeds of 1665 tps (!) with speculative decoding and LPUs as well ([source](https://groq.com/groq-first-generation-14nm-chip-just-got-a-6x-speed-boost…

    source: Reddit · u/TechNerd10191 · 2025-04-22

  • communityconfidence 60%

    627.0tok/s Qwen3 32B on Groq via hosted-api

    1M (103 tok/s) and Qwen3 235B at 0.56ms for $0.30/1M (50 tok/s).<p>Groq takes it further with models like Qwen3 32B at 0.14ms for $0.36/1M (627 tok/s).<p>Arbitrage alert: These blow away slower "enterprise" options costing 10x more, ideal for real-time apps<p>3. speed demons with…

    source: HN · u/demian101 · 2025-07-17

  • communityconfidence 60%

    200.0tok/s Llama 3 8B on Groq via hosted-api

    Groq surpasses 1,200 tokens/sec with Llama 3 8B When reading Hacker News you develop a signal/noise filter, where lots of headlines make bold claims but you filter them o

    source: HN · u/andy_xor_andrew · 2024-05-30

  • communityconfidence 60%

    175.0tok/s Mistral Small on Groq via hosted-api

    input and $0.39/M output. On that much memory, you can have at least 2 instances of QwQ with a lot of context that would maybe put out 150-175 t/s with concurrency which is around $5 a day. If you are using things like Mistral Small 3.1 or Gemma 3 27, it would be quite a bit more…

    source: Reddit · u/dionysio211 · 2025-04-17

RTX 5060 Ti

  • communityconfidence 55%

    1,200.0tok/s on RTX 5060 Ti IQ3_XXS

    total 32G vram) Contexto 130k, kv cache f16, llama.ccp Q5_K_XL - pp ~650 t/s, tg ~16 t/s Q4_K_M - pp ~800 t/s, tg ~19 t/s IQ3_XXS - pp ~1200 t/s, tg ~25 t/s

    source: Reddit · u/Embarrassed-Boot5193 · 2026-03-11

RTX 4090

  • communityconfidence 55%

    949.5tok/s deepseek-r1 on RTX 4090 awq

    2x 4090 - 32b - 949.5 tokens/s with 64 cocurrent reqs (aphrodite engine deepseek-r1-distill-qwen-32b-awq --kv-cache-dtype fp8)

    source: Reddit · u/Due_Car8412 · 2025-01-21

  • communityconfidence 70%

    665.2tok/s Qwen3-Coder-Next on RTX 4090 via llama.cpp

    with 256k context, llama.cpp autoparser branch, windows 11. prompt eval time = 50498.86 ms / 33594 tokens ( 1.50 ms per token, 665.24 tokens per second) eval time = 2282.98 ms / 82 tokens ( 27.84 ms per token, 35.92 tokens per second) t…

    source: Reddit · u/gtrak · 2026-02-18

  • communityconfidence 75%

    548.9tok/s Llama-3.1-70B on RTX 4090 via vllm FP8

    neuralmagic\_Meta-Llama-3.1-70B-Instruct-FP8-dynamic Avg generation throughput: \~29-30 tokens/s Avg prompt throughput: 548.9 tokens/s (4 GPUs, 4090, power limited to 325) (8x, 8x, 4x , 4x) 5950x taichi x570 vllm backend i didn't do the specific prompt to get the va

    source: Reddit · u/I_can_see_threw_time · 2024-08-01

  • communityconfidence 50%

    400.0tok/s on RTX 4090 via llama.cpp

    up**: * RTX 4090 workstation running llama.cpp * OpenCode on my MacBook * 4-bit quantized model, 64K context size, \~22GB VRAM usage * \~2,400 tok/s prefill, \~40 tok/s generation Based on my testing: * It works surprisingly well and makes correct tool calling for tasks like w…

    source: Reddit · u/garg-aayush · 2026-03-30

unknown

  • communityconfidence 45%

    787.0tok/s Mistral Small Q8_0

    * | - Size: 15.58 GiB / 26.90B params — fully on GPU, zero CPU offload - pp plateaus at 1024 (compute saturation at ~787 t/s) - Effective memory bandwidth utilization: ~304 GB/s (~51% of 600 GB/s theoretical) **vs Vulkan (same model):**

    source: Reddit · u/Icy_Gur6890 · 2026-04-09

  • communityconfidence 55%

    768.0tok/s Qwen3 Coder via llama.cpp AWQ

    oughput:** **1,218 tokens/s** # llama.cpp * **Quant:** `_unsloth/Devstral-Small-2-24B-Instruct-2512-UD-Q4_K_XL.gguf_` * **Throughput:** **768 tokens/s**

    source: Reddit · u/Holiday_Purpose_3166 · 2025-12-13

  • communityconfidence 45%

    711.0tok/s Qwen3-8B Q8_0

    Run with --flash-attn ``` **Dense model (Qwen3-8B Q8_0) — prompt processing:** - ROCm default, no flash attn: **711 t/s** - ROCm + flash attn only: **~3,980 t/s** - **5.5× improvement from one flag** --- ## …

    source: Reddit · u/Important_Quote_1180 · 2026-03-27

  • communityconfidence 55%

    669.0tok/s gpt-oss-20b via vllm Q3_K_M

    32B-Instruct-GGUF_Qwen2.5-VL-32B-Instruct-Q4_K_M.gguf`|18.48 GB|626 t/s|**22.95 t/s**| |`DeepSeek-R1-Distill-Qwen-32B-Q3_K_M.gguf`|14.84 GB|669 t/s|**24.12 t/s**| |`gpt-oss-20b-F16.gguf`|12.83 GB|2620 t/s|**87.09 t/s**| |`ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf`|11.27 GB…

    source: Reddit · u/reujea0 · 2026-01-01

  • communityconfidence 65%

    652.0tok/s Qwen3-30B-A3B via vllm FP8

    o 6000 blackwell with vllm Power limit set to 450w **Short Context (1K tokens):** * Single user: 88.4 tok/s * 10 concurrent users: **652 tok/s** throughput * Latency: 5.65s → 7.65s (1→10 users) **Long Context (256K tokens):** * Single user: 22.0 tok/s * 10 concurrent use…

    source: Reddit · u/notaDestroyer · 2025-10-16

  • communityconfidence 55%

    626.0tok/s DeepSeek-R1 via vllm Q4_K_M

    -2506-UD-Q5_K_XL.gguf`|15.63 GB|861 t/s|**32.20 t/s**| |`unsloth_Qwen2.5-VL-32B-Instruct-GGUF_Qwen2.5-VL-32B-Instruct-Q4_K_M.gguf`|18.48 GB|626 t/s|**22.95 t/s**| |`DeepSeek-R1-Distill-Qwen-32B-Q3_K_M.gguf`|14.84 GB|669 t/s|**24.12 t/s**| |`gpt-oss-20b-F16.gguf`|12.83 GB|2620 t/s…

    source: Reddit · u/reujea0 · 2026-01-01

  • communityconfidence 45%

    612.0tok/s Llama-3.1-8B BF16

    -----|------|--------|---------|----------|------|---------|----------|------| | Llama-3.1-8B-Instruct-BF16.gguf | 16GB | 1:1 | 2,279 t/s | 612 t/s | -73% | 12.61 t/s | **18.82 t/s** | +49% | | Mistral-Small-3.2-24B-Instruct-2506-UD-Q5_K_XL.gguf | 17GB | 1:1 | 1,658 t/s | 404 t/s…

    source: Reddit · u/reujea0 · 2026-01-10

  • communityconfidence 45%

    521.7tok/s Mixtral 8x7b Q5_K_M

    mings: load time = 28128.85 ms llama_print_timings: sample time = 492.61 ms / 257 runs ( 1.92 ms per token, 521.71 tokens per second) llama_print_timings: prompt eval time = 5323.88 ms / 61 tokens ( 87.28 ms per token, 11.46 tokens …

    source: Reddit · u/redditaccountno6 · 2024-05-04

  • communityconfidence 50%

    470.0tok/s Qwen3-32B via vllm

    I just ran 2 cards directly from my motherboard's PCIE slots, pcie4.0 at 8x and I saw PP to double. E.g. Qwen3-32B PP went from 230t/s to 470t/s.

    source: Reddit · u/MLDataScientist · 2025-07-22

  • communityconfidence 55%

    393.6tok/s Deepseek V3 via vllm Q4

    (120k tokens): • Successful Requests: 1000 • Benchmark Duration: 439.96 seconds • Request Throughput: 2.27 req/s • Output Token Throughput: 393.61 tok/s • Total Token Throughput: 921.91 tok/s • Mean TTFT: 23549.66 ms • Mean TPOT: 700.44 ms Full benchmarks are available here: Sub…

    source: Reddit · u/Ok-Perception2973 · 2025-01-06

  • communityconfidence 50%

    360.0tok/s gpt-oss-120b via llama.cpp

    on large contexts. Not as bad on ROCm compared to Vulkan, but still significant, like prompt processing goes from \~900 t/s on 0 context to 360 t/s on 32K context for gpt-oss-120b (that was before recent performance degradation). For comparison, my DGX Spark suffers a bit less fr…

    source: Reddit · u/Eugr · 2025-12-15

  • communityconfidence 55%

    212.0tok/s Llama-3.2-1B via llama.cpp Int4

    SNR (44.8 → 19.7 dB). Dead end. * Cascade offload: ruled out by AMD docs. * Speculative decoding with Llama-3.2-1B draft (44% accept rate, 212 t/s draft): **zero effective gain**. Spec decoding not helping is the interesting one, normally a 44% accept rate would buy you somethin…

    source: Reddit · u/brandedtamarasu · 2026-03-26

  • communityconfidence 55%

    184.6tok/s phi4 via ollama fp16

    s|98.90 tokens/s| |fp16|55.584083ms|48.72 tokens/s| **gemma3:1b** |Quantization|Load Duration|Inference Speed| |:-|:-|:-| |q4|55.116083ms|184.62 tokens/s| |fp16|55.034792ms|135.31 tokens/s| **phi4:14b** |Quantization|Load Duration|Inference Speed| |:-|:-|:-| |q4|25.423792ms|3…

    source: Reddit · u/purealgo · 2025-03-12

RTX 5090

  • communityconfidence 55%

    765.5tok/s DeepSeek V3 on RTX 5090 IQ4_XS

    X8 5.0, 4090s, A6000, A40 at X4 4.0 and 3090 at X1 3.0: 665.46 t/s PP, 25.90 t/s TG * 5090s at X8/X8 5.0, 4090s, and Ampere trio at X4 4.0: 765.51 t/s PP, 26.18 t/s TG. * 5090(1) at X16 5.0, 5090(2) at X4 5.0, all the rest at X4 4.0: 940 t/s PP, 26.75 t/s TG. * 5090s at X16 5.0, …

    source: Reddit · u/panchovix · 2026-01-16

  • communityconfidence 55%

    630.0tok/s on RTX 5090 via vllm BF16

    nchmarks & 3 Critical Fixes for Reasoning Models Benchmarks (BF16, no quantization): \- Single: \~83 tok/s \- Batched (10 concurrent): \~630 tok/s \- TTFT: 45–60ms \- VRAM: 30.6 / 32 GB Things that bit me: \- The HuggingFace reasoning parser plugin has broken imports on vL…

    source: Reddit · u/Impressive_Tower_550 · 2026-03-15

  • communityconfidence 75%

    555.8tok/s Qwen3-Coder on RTX 5090 via sglang AWQ

    atency? # 1. Choosing the Framework **RTX 5090 — Qwen3-Coder-30B-A3B-Instruct-AWQ** |Metric|vLLM|SGLang| |:-|:-|:-| |Output throughput|**555.82 tok/s**|207.93 tok/s| |Mean TTFT|**549 ms**|1,558 ms| |Median TPOT|**7.06 ms**|18.84 ms| vLLM wins by 2.7x. SGLang is required `--qu…

    source: Reddit · u/NoVibeCoding · 2026-03-06

  • communityconfidence 60%

    400.0tok/s GPT-OSS-120b on RTX 5090 via llama.cpp

    wer, and use llama.cpp to run those same models the GPT-OSS-120b model has to offload 24 layers the the CPU using --n-cpu-moe 24, and gets ~400 t/s pp and 21 t/s generation. Which is not bad considering the load I am putting on a consumer grade memory system. GPT-OSS-20b is an…

    source: Reddit · u/unrulywind · 2025-08-23

  • communityconfidence 60%

    400.0tok/s gpt-oss-20b on RTX 5090 via vllm

    UDA\_ARCH=120a / Tried VLLM also * Model: gpt-oss-20b MXFP4 (GGUF) / Safetensors **The Problem:** * First 1-2 inferences work fine (\~300-400 tok/s) * On the 3rd or 4th inference, GPU hangs with "CUDA error: illegal memory access" * GPU enters error state, fans spin to 100% * n…

    source: Reddit · u/Think_Illustrator188 · 2025-12-28

H200

  • communityconfidence 75%

    712.0tok/s Qwen2.5-32B on H200 via vllm FP16

    p dive on vLLM quantization. Tested AWQ, GPTQ, Marlin, GGUF, and BitsandBytes on Qwen2.5-32B using an H200. Stuff we found: * Marlin hits 712 tok/s, baseline FP16 does 461. Quantized and faster. * GPTQ without Marlin kernel is actually slower than FP16 (276 tok/s) * BitsandByte…

    source: Reddit · u/LayerHot · 2026-01-09

  • communityconfidence 50%

    398.0tok/s on H200 via vllm

    set. \- Single user: 207 tok/s, 35ms TTFT \- At 32 concurrent users: 2,267 tok/s, 85ms TTFT \- Peak throughput (no concurrency limit): 4,398 tok/s All of the benchmarks were done with [vLLM benchmark CLI](https://docs.vllm.ai/en/latest/benchmarking/cli/) Full numbers: …

    source: Reddit · u/LayerHot · 2026-01-20

A100

  • communityconfidence 75%

    700.0tok/s LLaMA-3 70B on A100 via vllm FP16

    Decoding speed is nuts—up to 1.8x more requests/sec than vLLM on an A100. TurboMind pushes 4-bit inference 2.4x faster than FP16, hitting \~700 tokens/sec at 100 users (LLaMA-3 70B Q4). Low latency (40-60ms), easy one-command server setup, and it even handles multi-round chats ef…

    source: Reddit · u/juliensalinas · 2025-04-08

RTX 3090

  • communityconfidence 60%

    700.0tok/s Qwen3 30B-A3B on RTX 3090 via sglang

    it. I was also really excited nearly a year ago for Qwen3 30B-A3B, and I had gotten it running quite fast on my 3090s (150tok/s single and 700tok/s batched per 3090, though i hadn't tested long context) and then I abjectly failed to come up with a use case for it, I acquired a 50…

    source: Reddit · u/michaelsoft__binbows · 2026-02-28

  • communityconfidence 45%

    693.2tok/s on RTX 3090 fp8

    90s Speed : 79.4 tok/s \--- Concurrent Generation (n=20) --- Total tokens : 20480 Wall time : 29.54s Throughput : 693.2 tok/s (aggregate) \--- Prefill / TTFT (target \~8000 input tokens) --- Input : 15381 tokens (from server) TTFT : 7053 …

    source: Reddit · u/Conscious_Cut_6144 · 2026-03-11

  • communityconfidence 70%

    600.0tok/s Llama 3.3 70b on RTX 3090 via exllamav2

    I run 2*3090 on ExLlamaV2 for Llama 3.3 70b at 4.5bpw with 32k context with tensor parallel at 600tok/s prompt ingestion and 30tok/s for generation, all for $1.5k thanks to ebay deals. Heck you can speed things even more with 4.0bpw + speculat

    source: Reddit · u/TyraVex · 2025-02-16

  • communityconfidence 60%

    399.1tok/s deepseek-r1 on RTX 3090 via ollama

    ----------------------------------- Model: deepseek-r1:32b Performance Metrics: Prompt Processing: 399.05 tokens/sec Generation Speed: 27.18 tokens/sec Combined Speed: 27.58 tokens/sec …

    source: Reddit · u/ChemicalExcellent463 · 2025-02-17

  • communityconfidence 60%

    187.0tok/s Qwen3-Coder-Next on RTX 3090 via ollama

    Next UD-Q4\_K\_XS via llama-server (Qwen3.5 under test as I type this) \- Layer split across both GPUs (PHB interconnect, no NVLink) \- \~187 tok/s prompt processing, \~81 tok/s generation The agent talks to any OpenAI-compatible endpoint, so it works with llama-server, Ollama…

    source: Reddit · u/fabiononato · 2026-03-07

B200

  • communityconfidence 55%

    651.7tok/s Qwen3-Coder on B200 FP8

    he largest gap on the most communication-heavy workload – **GLM-4.6-FP8 (8-way TP):** B200 is **4.87x** faster than PRO 6000 (8,036.71 vs 1,651.67 tok/s) – **Qwen3-Coder-480B (4-way TP):** B200 is **4.02x** faster than PRO 6000 (6,438.43 vs 1,602.96 tok/s) – **GLM-4.5-Air (single…

    source: Reddit · u/NoVibeCoding · 2026-02-11

SambaNova

  • communityconfidence 60%

    600.0tok/s Qwen 2.5 32b on SambaNova via hosted-api

    aNova, Cerebras and Grok have hardware that may be able to run it. Outside of that, don't think so. SambaNova has Qwen 2.5 32b Coder at 400-600 t/s.

    source: Reddit · u/FullOf_Bad_Ideas · 2025-01-02

Together AI

  • communityconfidence 60%

    588.0tok/s GPT-OSS-120B on Together AI via hosted-api

    fected regimes. We wrote up the details at cumulus.blog/ionattention.<p>On multimodal pipelines we get better performance than big players (588 tok/s vs. Together AI's 298 on the same VLM workload). We're honest that p50 latency is currently worse (~1.46s vs. 0.74s) — that's the …

    source: HN · u/vshah1016 · 2026-03-12

  • communityconfidence 70%

    200.0tok/s Gemma3-27b on Together AI via hosted-api

    s rocking for a laptop. 5k in 32 sec. is about 150 t/sec. I got an AMD guy to run some numbers on their new unified chip and he was showing 200 t/sec, but with a smaller 7b model. A larger model would have surely slowed him down. I run Gemma3-27b in IQ4-XS on an RTX 4770 ti a…

    source: Reddit · u/unrulywind · 2025-04-11

RTX 3090 Ti

  • communityconfidence 60%

    500.0tok/s Llama 3 8B on RTX 3090 Ti via vllm

    10 mins I can go through 24.9M prompt tokens and 1.69M generated tokens on 2x rtx 3090 Ti and Llama 3 8B W8A8 in vLLM 0.8.3 V1, so about 41,500 t/s input and 2800 t/s output, both at the same time. I went through like 8B+ tokens on single card this way before I put in the second …

    source: Reddit · u/FullOf_Bad_Ideas · 2025-04-16

H100

  • communityconfidence 60%

    400.0tok/s Llama 3.1 8B on H100 via sglang

    raining to get EAGLE draft models - if you're optimizing a production workload it will probably be worth it tough - the EAGLE team reported 400 TPS for a Llama 3.1 8B model on a single H100, that's bonkers: [https://x.com/hongyangzh/status/1903109123895341536](https://x.com/hongy…

    source: Reddit · u/randomfoo2 · 2025-03-25

  • communityconfidence 60%

    170.0tok/s Mistral 7B on H100 via llama.cpp

    /www.baseten.co/blog/benchmarking-fast-mistral-7b-inf...</a> benchmarks an H100 on Mistral 7B, and claims different providers range from 38-170 TPS depending on their proprietary configurations.<p>A used H100 server is roughly $80k according to google. H100 power draw alone is 70…

    source: HN · u/vessenes · 2024-05-20

MI300X

  • communityconfidence 45%

    370.0tok/s on MI300X FP16

    on) vs H100 * Bandwidth: 5.2 TB/s (faster than your desk llama can spit) H100: * Price: $28,000 (approximately one kidney) * Performance: 370 tokens/s/GPU (FP16), but it doesn't fit into one. * Memory: 80GB (MI300X has almost 2.5 times more VRAM!!) Key Points: 1. H100 is \~4.…

    source: Reddit · u/No_Training9444 · 2024-06-25

  • communityconfidence 55%

    353.0tok/s llama-2-70b on MI300X FP16

    85c28e456902a874152d145e2ea3f4d28f84ebf8 After some calculations: MI300X: * Price: $15,000 (or 1.5 million alpaca tokens) * Performance: 353 tokens/s/GPU (FP16) * Memory: 192GB HBM3 (that's a lot of context for your LLM to chew on) vs H100 * Bandwidth: 5.2 TB/s (faster than yo…

    source: Reddit · u/No_Training9444 · 2024-06-25

RTX 5060

  • communityconfidence 55%

    292.0tok/s on RTX 5060 via vllm FP8

    zation-goal balanced # Output tells you: # - Recommended precision: FP8 # - Batch size: 128 # - Expected throughput: 6,292 tokens/sec # - Memory usage: 15.2GB / 24GB # - Plus full vLLM config you can copy-paste **Validation:** Tested on my RTX …

    source: Reddit · u/1Hesham · 2025-11-18

Replicate

  • communityconfidence 60%

    264.0tok/s Qwen3 30B-A3B on Replicate via hosted-api

    it's now 605 t/s. Similarly, Llama 4 Scout has a pp512 of 103 t/s, and is now 173 t/s, although the HIP backend is significantly faster at 264 t/s. Unlike last time, I won't be taking any model testing requests as these sweeps take quite a while to run - I feel like there are en…

    source: Reddit · u/randomfoo2 · 2025-07-22

Radeon RX 9070 XT

  • communityconfidence 55%

    200.0tok/s on Radeon RX 9070 XT via llama.cpp FP16

    ~778 TFLOPS FP4 Compute, and I had 41 TFLOPS FP16 Compute, running Qwen 3.5 27b at similar quantizations (Q4 something vs. IQ3-XXS) he had ~200 TPS PP and ~20 TPS TG to memory, which was near identical to what I get on llama.cpp Vulkan inference of Qwen 3.5 27b IQ3-XXS. Intuiti…

    source: Reddit · u/Qwen30bEnjoyer · 2026-04-02

  • communityconfidence 60%

    180.0tok/s gpt oss 20b on Radeon RX 9070 XT via llama.cpp

    hanics) gpt OSs 20v won by a metric mile. It did run faster than gpt OSs though on llama.cpp (9070xt) with gpt OSs at 140 tok/s and lfm at 180 tok/s

    source: Reddit · u/ConversationOver9445 · 2026-02-25

M4

  • communityconfidence 60%

    180.0tok/s Qwen3-32B on M4 via lm-studio

    its 79.51% while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups. 5️⃣ The 0.6B micro-model races above 180 tok/s but tops out at 37.56% - that's why it's not even on the graph (50 % performance cut-off). All local runs were done with @lmstudi…

    source: Reddit · u/ResearchCrafty1804 · 2025-05-07

How this list is built

A daily crawler queries r/LocalLLaMA, r/Oobabooga, r/Ollama, r/MachineLearning, and HN for posts mentioning tok/s + GPU/Apple silicon. A regex extractor finds every (model × hardware × tok/s) tuple by nearest-neighbor matching within a paragraph, then sanity filters drop obvious prompt-eval numbers, hardware-class outliers, and low-confidence parses.

Folklore is shown alongside signed runs on model and hardware pages but never replaces them. If you have a signed run for a config surfaced here, your number wins. Submit one with pipx install llm-speed && llm-speed bench.

Total community claims indexed: 2176.