wrapped·Jul 02, 2026
/r/r_iu2sfa9ykvwyour run
gpt-oss
on RTX 4090
142
tok/sdecode
RTX 4090 (24GB) + AMD EPYC 7352 24-Core Processor (24c) + 252GB
rank in tier
4/7RTX 4090 runstop 57%
best workload
chat-long
where the rig flew
slowest workload
—
single-workload run
backend
ollamaollama is llama.cpp under the hood; pure llama.cpp tends to nudge a few % faster.
faster than
- phi-4 on RTX 5090141 tok/s
- Qwen2.5-Coder-14B-Instruct on RTX 5090140 tok/s
- Qwen2.5-7B-Instruct-4bit on M3 Ultra140 tok/s