Dissertation Talk: Towards Robust and Scalable Evaluation of Large Language Models – Wei-Lin Chiang

Title: Towards Robust and Scalable Evaluation of Large Language Models
Speaker: Wei-Lin Chiang
Advisor: Ion Stoica

Date: Friday, December 6, 2024
Time: 11 AM – 12 PM PT

Location: Soda 465H

Abstract
The rapid advancement of Large Language Models (LLMs), driven by scaling laws and substantial investments, has unlocked remarkable capabilities. Yet, effectively evaluating these generalist AI systems presents significant challenges, including their broad functionality, concerns over benchmark contamination, and the complexity of aligning with nuanced human preferences.

This dissertation presents robust and scalable evaluation systems to address these challenges. Central to this effort is Chatbot Arena, a live, crowdsourced platform that leverages human preferences to assess and compare LLMs in real-world scenarios, offering insights beyond traditional benchmarks via millions of interactions worldwide. This approach captures a diversity of perspectives, providing a deeper understanding of model performance across evolving, real-world use cases. Additionally, we examine LLM judges as automated systems for constructing high-quality, validated benchmarks at scale, providing a cost-effective alternative to manual evaluation. Together, these contributions enhance the scalability and robustness of AI evaluation methods towards more trustworthy and human-aligned AI systems.