Dissertation Talk: Understanding and Enhancing LLM Inference Efficiency with Speculative Decoding – Lily (Xiaoxuan) Liu

Speaker: Lily (Xiaoxuan) Liu

Advisors: Alvin Cheung, Ion Stoica

Date: Friday May 16, 2025

Time: 10:00am – 11:00am PT

Location: Soda Hall 510

Abstract:
Speculative decoding is a key technique for reducing inference latency in large language model (LLM) serving. This work explores several strategies to enhance its practical performance. First, we propose methods for improving speculation accuracy through online updates and highlight the potential of routing requests to customized draft models. We then address system-level challenges in deploying speculative decoding within production-grade infrastructures, focusing on its integration into vLLM. To this end, we introduce Dynamic Speculative Decoding (DSD), a framework that dynamically adapts to workload characteristics to optimize system efficiency. Finally, we present benchmarks evaluating different speculative decoding techniques in vLLM, demonstrating their speedups across diverse workloads. We analyze the gap between the theoretical and observed speedups and identify opportunities for further optimization. Collectively, these contributions advance the practicality and efficiency of speculative decoding in real-world deployments.