Sky Seminar: Yiying Zhang (UCSD) – LLM Serving Beyond Chats


Speaker: Yiying Zhang
Location: Soda 510
Time: 12pm-1pm PDT

Title: LLM Serving Beyond Chats

Abstract:
Today’s large language model (LLM) usage goes beyond primitive consumer-facing chatting and Q&A in several ways. For example, LLMs are often augmented by non-LLM models, tools, function calls, data sources, and human/virtual environment interactions. Complex tasks presented to LLMs are often decomposed into chains or trees to perform multi-step reasoning. At the same time, prompts to LLMs are becoming longer and more structured, resulting in some degree of repeated prompt text. Unfortunately, today’s LLM serving systems are not designed for these newer usage models. My lab has been building a holistic platform for supporting evolved LLM use cases, aiming to improve their serving speed and reduce cost.

Specifically, I will discuss two recent systems developed by my lab, InferCept and Preble. InferCept [ICML’24] is an inference system targeting LLMs augmented with non-LLM components. It dynamically chooses and optimizes the strategies to manage computed context state when an LLM’s output generation is intercepted by augmented entities. InferCept achieves up to 12x performance improvement over SoTA LLM inference systems. Preble [arxiv’24] is an LLM serving system that targets long prompts. With our study of five types of LLM workloads and real LLM chat history, we found input sequence lengths to be significantly longer than output and many parts of prompts to be reused across requests. By efficiently scheduling prompts and reusing computed intermediate results, Preble achieves up to 14.5x performance improvement over SoTA LLM serving systems.

Bio:
Yiying Zhang is an associate professor in the Computer Science and Engineering Department at the University of California, San Diego. Her research interests span ML-systems, operating systems, distributed systems, computer architecture, data-center networking, compilers, and systems security. Her current research focuses on building the next-generation compound AI systems. She has won several awards including OSDI best paper award, a SYSTOR best paper award, an FPGA best paper runner-up award, an NSF CAREER award, and various research awards from the industry, including Semiconductor Research Consortium, Google, Meta, Amazon, Intel, and VMware. Yiying received her Ph.D. from the Department of Computer Sciences at the University of Wisconsin-Madison and worked as an assistant professor at Purdue University before joining UCSD.