Speaker: Stephanie Wang
Location: Soda Hall, 510
Date: March 21, 2025
Time: 12 – 1 pm PT
Title: Towards a Distributed OS for ML Applications
Abstract
With the rise of large language models, distributed execution across multiple accelerators has become commonplace. Current ML systems must adopt complex distributed execution strategies for efficiency, but do so at the cost of extensibility. Thus, it is time to introduce a general-purpose distributed OS for flexible and high-performance programming of clusters of accelerators. In this talk, I will present prior and future work on the system Ray towards this goal. I’ll briefly describe the Ray Data system, which leverages Ray to pipeline execution across distributed CPUs and GPUs in heterogeneous ML pipelines. Next I’ll describe ongoing and future work in extending Ray with native support for distributed GPU tensors, with distributed use cases including MPMD training and inference, RLHF, and prefill disaggregation.
Speaker Bio:
Stephanie is an assistant professor at University of Washington, a creator of the open-source project Ray, and a founding engineer at Anyscale. Previously, she completed her PhD at UC Berkeley. Her research is in distributed systems, cloud computing, and systems for machine learning and data. Previous projects include Exoshuffle, which broke the Cloudsort record for cost-efficient distributed sort, and Ray Core, the distributed compute engine that was used to train GPT-4.