Yi-Coder-9B-Chat-4bit

6 workload results across 2 hardware configurations.

Fastest local config

199.3 decode tok/s

Local runs (6 runs)

Runs from contributors' own machines via MLX, llama.cpp, vLLM, exllamav2, or ollama. Signed on the submitter's hardware.

RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	llama.cpp	-	66.79tok/s	no data	317ms	r_zlh6az5q0o_
chat-long	llama.cpp	-	69.60tok/s	no data	1,195ms	r_zlh6az5q0o_
concurrent-decode	llama.cpp	-	70.99tok/s	no data	no data	r_zlh6az5q0o_
agent-trace	llama.cpp	-	71.76tok/s	4,478.0tok/s	397ms	r_zlh6az5q0o_
chat-short	llama.cpp	-	199.3tok/s	no data	36.2ms	r_u4iojm6-ekg

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	-	103.5tok/s	390.9tok/s	307ms	r_3hvui9a1yuc