February 10, 2023

Sky Seminar: Hakim Weatherspoon (Cornell) – Have Your Cake and Eat It: Reliably Running Stateful Virtual Machines in Cheap Spot Markets

Speaker: Hakim Weatherspoon Location: Soda 430-438, Woz Lounge Date: February 10, 2023 Time: 12-1pm PST Title: Have Your Cake and Eat It: Reliably Running Stateful Virtual Machines in Cheap Spot Markets Abstract: Cloud enterprise consumers spend millions of dollars each month renting space on computers owned by cloud providers.  Cloud Spot markets as provided by Amazon, Microsoft and Google allow the use of unrented computers for up to 10 times cheaper than the normal rate reducing costs up to 90%.  There is just one catch, cloud providers reserve the right to take those computers back at any time with little to no warning making the spot market nearly impossible to reliably use for stateful applications. In this talk, we explore the use of seamless live migration of stateful application containers and virtual machines (VMs) to take advantage of spot markets allowing stateful applications to benefit from significant discounts of cloud spot markets. We show that in unstable markets live migration of stateful application can achieve significant savings at low overhead and while maintaining good reliability. Bio: Hakim Weatherspoon is a Professor in the Department of Computer Science at Cornell University, Associate Director of the Cornell Institute for Digital Agriculture (CIDA), and  the Chief Executive Officer of Exotanium, Inc (http://exotanium.io). His research interests cover various aspects of fault-tolerance, reliability, security, and performance of internet-scale data systems such as cloud and distributed systems.  Weatherspoon received his PhD from University of California, Berkeley. Weatherspoon has received awards for his many contributions, including an the University of Washington, Allen School of Computer Science and Engineering, Alumni Achievement Award;  Alfred P. Sloan Research Fellowship; National Science Foundation CAREER Award; and a Kavli Fellowship from the National Academy of Sciences. He serves as Vice President of the USENIX Board of Directors and is the Founder, Steering Committee, and General Chair for the ACM Symposium on Cloud Computing. Hakim has also been recognized for his work to promote diversity, earning Cornell's Zellman Warhaft Commitment to Diversity Award.  Since 2011, he has organized the annual SoNIC Summer Research Workshop to help prepare between students from underrepresented groups to pursue their Ph.D. in computer science.


February 3, 2023

Sky Seminar: Carole-Jean Wu (Meta) – Scaling AI Computing Sustainably

Speaker: Carole-Jean Wu Location: Soda 430-438, Woz Lounge Date: February 3, 2023 Time: 12-1pm PST Title: Scaling AI Computing Sustainably Abstract: The past 50 years has seen a dramatic increase in the amount of compute per person, in particular, those enabled by AI. Modern natural language processing models are fueled with over trillion parameters while the memory needs of neural recommendation and ranking models have grown from hundreds of gigabytes to the terabyte scale. I will highlight recent advancement on important deep learning models and present hardware-software optimization opportunities across the machine learning system stack. AI technologies come with significant environmental implications. I will talk about the carbon footprint of AI computing by examining the model development cycle, spanning data, algorithms, and system hardware, and, at the same time, considering the life cycle of system hardware from the perspective of hardware architectures and manufacturing technologies. The talk will capture the operational and manufacturing carbon footprint of AI computing. Based on the industry experience and lessons learned, I will share key challenges across the many dimensions of AI and what and how at-scale optimization can help reduce the overall carbon footprint of AI and computing. This talk will conclude with important development and research directions to advance the field of computing in an environmentally-responsible and sustainable manner. Bio: Carole-Jean Wu is currently a Research Scientist at Meta. She is a founding member and a Vice President of MLCommons – a non-profit organization that aims to accelerate machine learning innovations for the benefits of all. Dr. Wu also serves on the MLCommons Board of a Director, chaired the MLPerf Recommendation Benchmark Advisory Board, and co-chaired for MLPerf Inference. Prior to Facebook/Meta, Carole-Jean was an Associate Professor at ASU. Dr. Wu’s expertise sits at the intersection of computer architecture and machine learning. Her work spans across datacenter infrastructures and edge systems, such as developing energy- and memory-efficient systems and microarchitectures, optimizing systems for machine learning execution at-scale, and designing learning-based approaches for system design and optimization. She is passionate about pathfinding and tackling system challenges to enable efficient and responsible AI technologies. Her work has been recognized with several awards, including IEEE Micro Top Picks and ACM / IEEE Best Paper Awards. In addition, her work has been featured at the MLPerf Inference v0.5 Launch and Results, MaskRCNN2Go for MLPerf, Tech @ Meta, and Bloomberg Green. Dr. Wu is the recipient of NSF CAREER Award, IEEE Young Engineer of the Year Award, Science Foundation Arizona Bisgrove Early Career Scholarship, Facebook AI Infrastructure Mentorship Award, and HPCA and IISWC Hall of Fame. She was the Program Co-Chair of the Conference on Machine Learning and Systems (MLSys) and Program Chair of the IEEE International Symposium on Workload Characterization (IISWC). She received her M.A. and Ph.D. degrees in Electrical Engineering from Princeton University and the B.Sc. degree in Electrical and Computer Engineering from Cornell University.


January 21, 2023

Sky Seminar: Manya Ghobadi (MIT) – Next-Generation Optical Networks for Emerging ML Workloads

Speaker: Manya Ghobadi Location: Soda 430-438, Woz Lounge Date: Friday, Jan 20, 2023 Time: 12 - 1 pm PST Title: Next-Generation Optical Networks for Emerging ML Workloads Abstract: In this talk, I will explore three elements of designing next-generation machine learning systems: congestion control, network topology, and computation frequency. I will show that fair sharing, the holy grail of congestion control algorithms, is not necessarily desirable for deep neural network training clusters. Then I will introduce a new optical fabric that optimally combines network topology and parallelization strategies for machine learning training clusters. Finally, I will demonstrate the benefits of leveraging photonic computing systems for real-time, energy-efficient inference via analog computing. Pushing the frontiers of optical networks for machine learning workloads will enable us to fully harness the potential of deep neural networks and achieve improved performance and scalability. Bio: Manya Ghobadi is faculty in the EECS department at MIT. Her research spans different areas in computer networks, focusing on optical reconfigurable networks, networks for machine learning, and high-performance cloud infrastructure. Her work has been recognized by the Sloan Fellowship in Computer Science, ACM SIGCOMM Rising Star award, NSF CAREER award, Optica Simmons Memorial Speakership award, best paper award at the Machine Learning Systems (MLSys) conference, as well as the best dataset and best paper awards at the ACM Internet Measurement Conference (IMC). Manya received her Ph.D. from the University of Toronto and spent a few years at Microsoft Research and Google prior to joining MIT.


November 4, 2022

Sky Seminar: Zhifeng Chen – Google Research “Some scalability challenges in machine learning”

Speaker: Zhifeng Chen Location: Soda 430-438, Woz Lounge Date: November 4, 2022, Friday Time: 12-1 pm PST Title: Some scalability challenges in machine learning Abstract: Over the past decade, AI presented us with numerous amazing results and had a huge impact on our lives. These achievements were results of advancements in the areas of algorithms, data and hardware. Scalability is a common theme in the development of these areas. In this talk, I will share some scalability challenges faced by my colleagues and myself in Google and what we have built to address them. I will also discuss some challenges we are currently facing and research directions that may help solve them. Bio: Dr. Zhifeng Chen is a distinguished engineer in Google Research, Brain. His recent work focuses on scalable machine learning systems and algorithms. He collaborates with many machine learning researchers and is interested in areas such as machine translation, speech recognition and synthesis, 3D perception, and large language models. He helped build several Google’s infrastructure software systems, including TensorFlow, Zanzibar, and BigTable, etc.


October 28, 2022

Sky Seminar: Fredrik Kjolstad – Stanford

Title: Portable Compilation of Sparse Computation Abstract: Hardware is becoming more diverse and architects are designing a host of new accelerators. Different types of accelerators are being deployed in different data centers, making it harder to port applications across machines and across clouds. I will discuss the design of compilers for data-intensive application in heterogeneous systems. I will then describe how to compile sparse tensor algebra and array operations to the major classes of heterogeneous hardware: CPUs, fixed-function accelerators, GPUs, distributed machines, and streaming dataflow accelerators. Finally, I will discuss the promise of portable compilation of more general classes of computations. Bio: Fredrik Kjolstad is an Assistant Professor in Computer Science at Stanford University. He works on topics in compilers, programming models, architecture, and systems, with an emphasis on fast compilation and compilers for sparse computing problems where we should separate the algorithms from data representation. He has received the MIT EECS Sprowls PhD Thesis Award, the NSF CAREER Award, a Google Research Scholarship, and three distinguished paper awards.


October 21, 2022

Sky Seminar: Emmett Witchel – UT Austin

Abstract: Serverless computing has become increasingly popular for building scalable cloud applications, such as video analytics and machine learning. One key challenge is the mismatch between the stateless nature of serverless functions and the stateful applications built with them. Managing shared state using current options, e.g., cloud databases or object stores, struggles to achieve strong consistency and fault tolerance while maintaining high performance and scalability. Distributed shared logs provide storage that can simultaneously achieve scalability, strong consistency, and fault tolerance. A distributed shared log offers a simple abstraction: a totally ordered log that can be accessed and appended concurrently. Boki is a functions-as-a-service (FaaS) runtime that provides a shared log API for serverless functions to store shared state. Boki separates the read and write path, where writes are optimized with scale-out bandwidth and read locality is optimized with a cache on function nodes. It also provides flexible metadata tags to optimize selective reads. Boki provides high performance, read consistency and fault tolerance with a single log-based mechanism, the metalog. We implement streaming computations on Boki featuring exactly once execution for each input record, even under server failures. Serverless functions provide elastic compute that can be scaled up or down. Keeping related records in a single, totally ordered Boki logbook provides significant speedups in part due to replacing a two-phase commit protocol with a one-phase protocol. Bio: Emmett Witchel is Professor of Computer Science at the University of Texas at Austin, where he has been on the faculty since 2004, after receiving his PhD at MIT. His thesis won honorable mention for the ACM doctoral dissertation award. Witchel's research interests include operating systems, security, architecture, and concurrency. His recent work has been on serverless computing, fault-tolerant logs, persistent memory, side-channel security for GPUs, and trusted execution environments. He co-chaired architectural support for programming languages and operating systems (ASPLOS) in 2019. His publishing recognition includes research highlights in Communications of the ACM (CACM), IEEE Micro top picks, and best paper awards at both the symposium on operating systems principles (SOSP) and operating systems design and implementation (OSDI). He was on the founding team of Katana Graph (2019), a graph database company, and works as a principle engineer.


October 19, 2022

Sky Camp 2022

Sky Computing will be holding our first annual camp event with talks and tutorials on October 19, 2022, in person at Banatao Auditorium on the UC Berkeley Campus. Visit the website for more details or email skycampATberkeley.edu.


October 14, 2022

Sky Seminar: Mark Russinovich – Microsoft Azure

Title: Microsoft’s AI Infrastructure Innovations Abstract: Training advanced deep learning models can be challenging. Large models can provide better accuracy but can be difficult to train because of the time cost and the complexity of code integration, plus it requires a massive-scale infrastructure. In this session, Mark Russinovich will share how Microsoft designs and improves our AI infrastructure and system software to serve both our internal teams and customers. Furthermore, Mark will share a future trend we see in the emergence of ML confidential computing for both data privacy and compliance requirements from many industries. Bio: Mark Russinovich is Chief Technology Officer and Technical Fellow for Microsoft Azure, Microsoft’s global enterprise-grade cloud platform. A widely recognized expert in distributed systems, operating systems and cybersecurity, Mark earned a Ph.D. in computer engineering from Carnegie Mellon University. He later co-founded Winternals Software, joining Microsoft in 2006 when the company was acquired. Mark is a popular speaker at industry conferences such as Microsoft Ignite, Microsoft Build, and RSA Conference. He has authored several nonfiction and fiction books, including the Microsoft Press Windows Internals book series, Troubleshooting with the Sysinternals Tools, as well as fictional cyber security thrillers Zero Day, Trojan Horse and Rogue Code.


October 7, 2022

Sky Seminar: Russell Sears – Apple

Title: Microsoft’s AI Infrastructure Innovations Abstract: Training advanced deep learning models can be challenging. Large models can provide better accuracy but can be difficult to train because of the time cost and the complexity of code integration, plus it requires a massive-scale infrastructure. In this session, Mark Russinovich will share how Microsoft designs and improves our AI infrastructure and system software to serve both our internal teams and customers. Furthermore, Mark will share a future trend we see in the emergence of ML confidential computing for both data privacy and compliance requirements from many industries. Bio: Mark Russinovich is Chief Technology Officer and Technical Fellow for Microsoft Azure, Microsoft’s global enterprise-grade cloud platform. A widely recognized expert in distributed systems, operating systems and cybersecurity, Mark earned a Ph.D. in computer engineering from Carnegie Mellon University. He later co-founded Winternals Software, joining Microsoft in 2006 when the company was acquired. Mark is a popular speaker at industry conferences such as Microsoft Ignite, Microsoft Build, and RSA Conference. He has authored several nonfiction and fiction books, including the Microsoft Press Windows Internals book series, Troubleshooting with the Sysinternals Tools, as well as fictional cyber security thrillers Zero Day, Trojan Horse and Rogue Code.


September 30, 2022

Sky Seminar: Ioannis Papapanagiotou – Gemini / Netflix

Title: Elastic Cloud Services: Scaling Snowflake’s Control Plane Abstract: Snowflake’s "Data Cloud" enables data storage, processing, and analytic solutions in a performant, easy to use, and flexible manner. Although cloud service providers provide the foundational infrastructure to run and scale a variety of workloads, operating Snowflake on cloud infrastructure presents interesting challenges. Customers expect Snowflake to be available at all times and to run their workloads with high performance. Behind the scenes, the software that runs customer workloads needs to be serviced and managed. Additionally, failures in individual components such as Virtual Machines (VM) need to be handled without disrupting running workloads. As a result, lifecycle management of compute artifacts, their scheduling and placement, software rollout (and rollback) processes, replication, failure detection, automatic scaling, and load balancing become extremely important. In this talk, we will cover the design and operation of Snowflake's Elastic Cloud Services (ECS) layer that manages cloud resources at global scale to meet the needs of the Snowflake Data Cloud. It provides the control plane to enable elasticity, availability, fault tolerance and efficient execution of customer workloads. ECS runs on multiple cloud service providers and provides capabilities such as cluster management, safe code rollout and rollback, management of pre-started pools of running VMs, horizontal and vertical autoscaling, throttling of incoming requests, VM placement, load-balancing across availability zones and cross-cloud and cross-region replication. Bio: Ioannis Papapanagiotou is a director of engineering working building a modern Blockchain Platform at Gemini. Ioannis is also a research assistant professor at the University of New Mexico. He holds a dual Ph.D. degree in Computer Engineering and Operations Research. His main focus is on data platforms, cloud computing, and corporate culture. In the past, Ioannis served as the senior manager of the Services organization at Snowflake supporting the cores services of the Snowflake's Data Cloud. Prior to that, Ioannis was a senior manager at Netflix's Data Platform building from ground up the storage and data integrations team and also serving as the leader of the key/value stores and database streaming infrastructure. Ioannis has served in the faculty ranks of Purdue University (tenure-track) and NC State University, and was an engineer at IBM and a mentor to several startups. He has been awarded the NetApp faculty fellowship and established an Nvidia CUDA Research Center at Purdue University. Ioannis has also received the IBM Ph.D. Fellowship, Academy of Athens Ph.D. Fellowship for his Ph.D. research, and best paper awards in several IEEE conferences for his academic contributions. Ioannis has authored a number of research articles and patents. Ioannis is a senior member of ACM and IEEE.