The Tip of the Iceberg: How to make ML for Systems work
Machine Learning has become a powerful tool to improve computer systems and there is a significant amount of research ongoing both in academia and industry to tackle systems problems using ML. Most work focuses on learning patterns and replacing heuristics with these learned patterns to solve systems problems such as compiler optimization, query optimization, failure detection, indexing, and caching. However, solutions that truly improve systems need to maintain the efficiency, availability, reliability, and maintainability of systems while integrating Machine Learning into the system. In this talk, I will cover the key challenges and surprising joys of designing, implementing and deploying ML for Systems solutions based on my experiences of building and deploying these systems at Google. Deniz Altınbüken is a Senior Software Engineer at Google Research and part of the Google Brain team. She received her PhD in distributed systems from Cornell University in 2017 under the supervision of Robbert van Renesse, specializing on consensus protocols and self-adapting systems. Since her PhD she has worked on large-scale database systems and learned systems. Currently, she is focusing on improving the state-of-the-art in systems using Machine Learning, mainly focusing on caching and indexing. Her work has appeared in top-tier conferences and workshops such as SOSP and ML for Systems at NeurIPS.
The Story of Raft
In this talk I will discuss the back-story behind the Raft consensus algorithm: why we decided to undertake this project, how the algorithm developed, and the challenges of publishing an idea that “”gores a sacred cow”” I will also make several observations about how to perform research, how program committees work, and the relationship between Paxos and Raft.”
From Car Mechanics to Drivers: Automated Machine Learning for the Rest of us
Automatic Machine Learning offers the opportunity to build well-founded statistical models for tabular, image and text data. A common strategy is to pick a model and to use AutoML to optimize the hyperparameters of said model. In this talk I argue that ensembling and stacking offers a much more robust and effective solution to the problem of automatic modeling. This is particularly relevant, when trying to find the best model subject to a runtime constraint. I’ll discuss what is next for AutoGluon in terms of community, features, and scientific challenges.
Amazon Redshift Re-invented
In 2013, Amazon Web Services revolutionized the data warehousing industry by launching Amazon Redshift, the first fully managed, petabyte-scale cloud data warehouse solution. Amazon Redshift made it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. This launch was a significant leap from the traditional on-premise data warehousing solutions which were expensive, rigid (not elastic), and needed a lot of tribal knowledge to perform. Unsurprisingly, customers embraced Amazon Redshift and it went on to become the fastest growing service in AWS. Today, tens of thousands of customers use Amazon Redshift in AWS’s global infrastructure of 25 launched Regions and 81 Availability Zones (AZs) to process Exabytes of data daily.
The success of Amazon Redshift inspired a lot of innovation in the analytics industry which in turn has benefited consumers. In the last few years, the use cases for Amazon Redshift have evolved and in response, Amazon Redshift has delivered a series of innovations that continue to delight customers. In this talk, we take a peek under the hood of Amazon Redshift, and give an overview of its architecture. We focus on the core of the system and explain how Amazon Redshift maintains its differentiating industry-leading performance and scalability. We discuss how Amazon Redshift extends beyond traditional data warehousing workloads, by integrating with the broad AWS ecosystem making Amazon Redshift a one-stop solution for analytics. We then talk about Amazon Redshift’s autonomics and Amazon Redshift Serverless. In particular, we present how Redshift continuously monitors the system and uses machine learning to improve its performance and operational health without the need of dedicated administration resources, in an easy to use offering.” Ippokratis Pandis is a senior principal engineer at Amazon Web Services, currently working in Amazon Redshift. Redshift is Amazon’s fully managed, petabyte-scale data warehouse service. Previously, Ippokratis has held positions as software engineer at Cloudera where he worked on the Impala SQL-on-Hadoop query engine, and as member of the research staff at the IBM Almaden Research Center, where he worked on IBM DB2 BLU. Ippokratis received his PhD from the Electrical and Computer Engineering department at Carnegie Mellon University. He is the recipient of Best Demonstration awards at ICDE 2006 and SIGMOD 2011, and Test-of-Time award at EDBT 2019. He has served or serving as PC chair of DaMoN 2014, DaMoN 2015, CloudDM 2016, HPTS 2019 and ICDE Industrial 2022, as well as General Chair of SIGMOD 2023.
Building scalable and flexible cluster managers using declarative programming
Modern cluster managers routinely grapple with hard combinatorial optimization problems, such as policy-based load balancing, placement, scheduling, and configuration. Implementing ad-hoc heuristics to solve these problems is notoriously hard to do, making it challenging to evolve the system over time and add new features.
In this talk, I will present Declarative Cluster Managers (DCM), a general approach for building cluster managers that makes them performant and easily extensible. With DCM, developers specify the cluster manager’s behavior using a high-level declarative language like SQL and let a compiler take care of generating an efficient implementation. I will show how DCM significantly lowers the barrier to building scalable and extensible cluster manager components, in the context of some real-world systems like Kubernetes. You can check out the DCM project here: https://github.com/vmware/declarative-cluster-management
Building cloud-native fault-tolerant applications with reliable actors and retry orchestration
Cloud developers have to build applications that are resilient to failures and interruptions. In this talk, we advocate for a fault-tolerant programming model for the cloud based on actors, reliable message delivery, tail calls, and retry orchestration. This model not only guarantees that (1) failed actor invocations will be retried but also that (2) observed completed invocations are never repeated and (3) it preserves a strict happens before relationship across failures within call chains and call stacks. Together these capabilities make it possible to productively develop fault-tolerant applications leveraging arbitrary cloud services. We review key application patterns and failure scenarios. We formalize a process calculus to precisely capture the mechanics of fault tolerance in this model. We demo a prototype implementation as a polyglot service mesh that scales with the application. Using an application inspired by a typical enterprise scenario, we assess the impact of fault preparedness and recovery on performance. Dr. Olivier Tardieu is a Principal Research Scientist at IBM T.J. Watson, NY, USA. He received a PhD in Computer Science from Ecole des Mines de Paris in 2004, and was a Postdoctoral Research Scientist at Columbia University, before joining IBM Research in 2007. His research focuses on making developers more productive with better programming models and methodologies in application domains ranging from hardware circuit design and stream processing to high-performance computing and cloud. He has authored more than fifty peer-reviewed publications. Dr. Tardieu is one of the architects of the X10 programming language and a founding member of the Apache OpenWhisk project for serverless computing.
The Design and Evolution of the Tock Operating System
Tock is a secure operating system for embedded microcontrollers. Starting as a collaboration between Stanford (OS), Berkeley (hardware) and Michigan (applications and platforms), Tock has become an international collaboration with adoption by several large technology companies. One important early design decision was to implement Tock in Rust, a memory-safe systems programming language which does not rely on garbage collection. I will discuss how Tock has evolved over the past 6 years, especially its security model. I will also discuss some challenges in applying Rust to operating system kernels, particularly how its ownership model and borrow checker interact poorly with event-driven execution. Philip Alexander Levis is an Associate Professor of Computer Science and Electrical Engineering at Stanford University, where he heads the Stanford Information Networks Group (SING). His research centers on computing systems that interact with or represent the physical world, including low-power computing, wireless networks, sensor networks, embedded systems, and graphics systems. He has been awarded the Okawa Fellowship, an NSF CAREER award, and a Microsoft New Faculty Fellowship. He’s authored over 60 peer-reviewed publications, including four best paper awards, two test of time awards, and one most influential paper award. He has an Sc.B. in Biology and Computer Science with Honors from Brown University, a M.S. in Computer Science from The University of Colorado at Boulder, and a Ph.D. in Computer Science from The University of California, Berkeley. He has a self-destructive aversion to low-hanging fruit and a deep appreciation for excellent engineering.
Networked Systems in the Age of Really Fast Networks
Computer networks have come a long way, from the 10 Mbps Ethernet standard of the 1980s to the 100s of Gbps (and soon to be Tbps) of today’s networks. For application developers, the benefits of growth are clear: higher bandwidth leads to faster network transfers and, thus, to better application performance. For network architects and operators, however, higher bandwidth is a double-edged sword and can lead to both additional challenges and additional opportunities.
In this talk, I will discuss a few of the ways in which my lab has been exploring these issues. To that end, I will discuss our efforts to characterize the challenges of measuring modern networks and techniques to overcome those challenges. I will also discuss our efforts to make model serving lightweight and fast enough for use in today’s networked systems.”
Vincent Liu is an Assistant Professor in the Department of Computer and Information Science at the University of Pennsylvania, where he leads the PennNetworks group and is a member of the Distributed Systems Lab (DSL). His research interests are in the areas of distributed systems and networking and have been recognized by an NSF CAREER Award, a VMWare Early Career Faculty Award, a Facebook Faculty Research Award, and several best paper awards at SIGCOMM and NSDI. He received his Ph.D. from the University of Washington while advised by Tom Anderson and Arvind Krishnamurthy.
Connecting Blockchains and the World
What would it take for blockchains to enable people and businesses everywhere better, trusted and innovative financial foundations?
In the first part of the presentation, I will briefly recall the Diem story, June 2018-January 2022. I joined the Diem project (it was named Libra back then) in 2019 as CTO and stayed until its closing. I will explain the stablecoin structure Diem built over a purpose-built blockchain.
In the second part, I will switch gears and talk about my current role as chief research officer at Chainlink Labs. Chainlink enables smart contracts to interact with the real world and to reduce trust in centralized intermediaries. I will provide a glimpse into Chainlink Labs technology and a research outlook.
Understand and Leverage Heterogeneity in Machine Learning Clusters
The products of ByteDance heavily rely on machine learning (ML). My group builds large-scale clusters and systems to support all the ML workloads, including model training and online inference. In this talk, I will explain the technical challenges caused by the heterogeneity of ML jobs and hardware resources and how we address the challenges. The core part of the talk will start from improving training speed and cluster utilization by better leveraging heterogeneous resources. Then, I will share our research and deployment experience on the co-scheduling of training and inference jobs and clusters, which aims to further improve company-wide ML resource utilization. Finally, I will discuss some personal views on future directions.
Yibo is a Research and Engineering Manager and oversees the research and development of machine learning systems at ByteDance. His research interests broadly cover distributed systems and networks, with a recent focus on ML systems. He enjoys building large-scale systems with emerging hardware like GPUs, programmable ASICs, and NICs. Yibo’s work can be characterized as being at the intersection of academic research and practical deployment. Many of his works were published in top conferences, like SOSP, OSDI, and SIGCOMM, and meanwhile deployed at scale in ByteDance and former employer Microsoft. Yibo obtained his Ph.D. from the Department of Computer Science at UC Santa Barbara, co-advised by Prof. Ben Y. Zhao and Prof. Heather Zheng.
Rockset: Realtime Indexing of semi-structured data for fast analytics on massive datasets
We’re in the middle of an era change where companies are moving from cloud to serverless, allowing them to automatically scale and manage servers; and from batch to real-time, making instant decisions based on the freshest of data. Rockset is a real-time indexing database that is also serverless; powering fast SQL over semi-structured data such as JSON, Parquet, etc without requiring any schematization. All data loaded into Rockset are automatically indexed and don’t require any database tuning. In this talk, we’ll cover some key design aspects of Rockset:
- Smart Schema: a type system that allows for ingesting any semi-structured data set and presenting them as SQL tables,
- Converged indexing: a data indexing strategy that builds inverted indexes and columnar indexes on all fields in the data set, and
- The Aggregator Leaf Tailer architecture: scale storage, indexing compute and query compute separately and provide elastic storage management using open source RocksDB-Cloud.
We will share our SSB benchmark results to demonstrate that Rockset can scale to millisecond-query-latencies on terabyte-size datasets. We will give you a live demonstration of building a real-time recommendation system using Twitter’s live tweet stream.
As future projects, we can discuss how to use machine-learning methods on query logs to selectively build only those indices that are needed. Another discussion subject is to analyze how Rockset uses RocksDB’s LSM Storage Engine to store data in columnar format. Yet another discussion subject is to evaluate the efficiency of LSM database using a cloud storage hierarchy: Rockset stores LSM data in AWS S3, but S3 read latency is high and spiky. So Rockset uses the cloud’s storage hierarchy to cache sst files from S3 on a remote SSD-based storage tier and an SSD-based local block cache.” Dhruba Borthakur is the CTO and co-founder of Rockset (http://rockset.com). Previously, he was an engineer on the database team at Facebook, where he was the founding engineer of the RocksDB data store. Earlier at Yahoo, he was one of the founding engineers of the Hadoop Distributed File System. He was also a contributor to the open-source Apache HBase project. Dhruba previously held various roles at Veritas Software, founded an e-commerce startup, Oreceipt.com, and contributed to Andrew File System (AFS) at IBM-Transarc Labs.