TokenArena — benchmark for AI inference endpoints at unit cost
New arXiv paper introduces TokenArena, a continuous benchmark measuring AI inference not at the model level, but at the endpoint: the specific (provider, model, SKU) tuple with its quantization, decoding strategy, region, and serving stack.
Synthesizes five core metrics—output speed, time to first token, workload-blended price, effective context, quality—plus modeled energy into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity.
Shifts inference benchmarking from lab to real-world deployment unit, where actual cost and energy trade-offs are decided.