Arena Hard

From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline

Building an affordable and reliable benchmark for LLM chatbots has become a critical challenge. A high-quality benchmark should 1) robustly separate model capability, 2) reflect human preference in real-world use cases, and 3) frequently update to avoid over-fitting or test set leakage.

We introduce Arena-Hard – a data pipeline to build high-quality benchmarks from live data in Chatbot Arena, which is a popular crowd-sourced platform for LLM evals. To measure its quality, we propose two key metrics:

  1. Agreement to Human preference: whether the benchmark score has high agreement to human preference.
  2. Separability: whether the benchmark can confidently separate models.

Arena-hard-v0.1 shows the highest separability (87.4%) against widely adopted LLM benchmarks and offers highest agreement (89.1%) to Chatbot Arena. It is also cheap and fast to run ($25).


Contributors

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica

Publications