Sky Seminar: Baris Kasikci (UW) – The Quest for Blazingly Fast LLM Serving

Date: March 7, 2025

Time: 12-1pm PST

Location: Soda 510

Title: The Quest for Blazingly Fast LLM Serving

Abstract:
Large Language Models (LLMs) have resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of Users. Recent developments have pushed LLM serving to a compute-bound regime for most common workloads. Alas, most existing serving engines fall short from optimal compute utilization, because the heterogeneous operations that comprise LLM serving—compute, memory, networking—are executed sequentially within a device.

In this talk I’ll introduce Nanoflow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of heterogeneous resources within a single device. Nanoflow splits inputs into smaller nano-batches and duplicates operations to operate on each portion independently, enabling overlapping. Nanoflow automatically identifies the number, size, ordering, and GPU resource allocation of nano-batches to minimize the execution time, while considering the interference of concurrent operations. We evaluate Nanoflow’s end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8×7B, LLaMA-3-8B, etc. With practical workloads, Nanoflow provides 1.91× throughput boost compared to state-of-the-art serving systems, achieving between 50 % to 72 % of optimal throughput across popular models.

Speaker Bio:
Baris Kasikci is an associate professor in the Paul G. Allen School of Computer Science & Engineering At the University of Washington. His research focuses on building large-scale computer systems that are efficient, reliable, and secure. Previously, he was a Morris Wellman assistant professor in the EECS Department at the University of Michigan and before that, a researcher at Microsoft Research. He has a PhD in Computer Science from EPFL and has held roles at Google, Intel, and VMware. He is the recipient of an NSF CAREER award, Intel Rising Star Award, Google Faculty Awards, Intel Faculty Awards, IEEE MICRO Top Picks Awards, Jay Lepreau Best Paper Award at OSDI, SIGCOMM Best Paper Award, MICRO Best Paper Award, VMware fellowship, Roger Needham PhD Award for the best PhD thesis in computer systems in Europe, and the Patrick Denantes Memorial Prize for best PhD thesis at EPFL. More details can be found on his webpage https://homes.cs.washington.edu/~baris/.