Speaker: Zhihao Jia
Location: Soda 510
Date: December 8, 2023
Time: 11am-12pm PST
The high computational and memory requirements of large language models (LLMs) make it challenging to train and serve them cheaply and efficiently. For example, serving a LLAMA-2-70B on NVIDIA A100 GPUs can only utilize 2% of the available compute resources. In this talk, I will present two systems for enabling fast, efficient, and cheap LLM computation. First, SpecInfer is a low-latency LLM serving system that accelerates autoregressive LLM inference with tree-based speculative inference and verification. A key insight behind SpecInfer is to combine multiple collectively boost-tuned small speculative models to jointly predict an LLM’s outputs and verify their correctness against the LLM using a tree-based parallel decoding mechanism. Compared to existing LLM serving systems, SpecInfer reduces the number of LLM decoding steps by 4.4x and the end-to-end inference latency by 2.4x, while preserving LLMs’ generative quality.
Second, Collie is a low-cost co-serving system for LLM inference and parameter-efficient fine-tuning (PEFT). Based on an observation that LLM inference and PEFT are complementary workloads, Collie uses a PEFT-as-a-service approach to unifying the serving interface of inference and fine-tuning jobs, and jointly serve these two types of jobs to maximize GPU utilization. Compared to existing approaches that use separate systems to serve and finetune LLMs, Collie’s co-serving design requires less GPUs, improves their utilization, and achieves higher fine-tuning throughput, while preserving the SLA for inference jobs.
Zhihao Jia is an assistant professor in the Computer Science Department at Carnegie Mellon University. His research interests lie in the intersection of computer systems and machine learning, with a focus on building efficient, scalable, and performant systems for ML applications. Prof. Jia received his PhD and MS from Stanford University where he also got the Arthur Samuel Best Doctoral Thesis Award. He is the recipient of an NSF CAREER award and research awards from Amazon, Cisco, Google, Meta, Oracle, Qualcomm, and Samsung.