Date: Friday, September 26th, 2025
Time: 12-1pm PST
Location: 510 Soda
Title: Analyzing Metastable Failures
Abstract: Metastable failures are congestive collapses in which the system does not recover after a transient stressor, such as increased load or diminished capacity, subsides. They are rare, but potentially catastrophic if the failure cascades across inter-dependent micro-services, and they are notoriously hard to diagnose and mitigate, sometimes causing prolonged outages affecting millions of users. Standard resiliency mechanisms, including retry with exponential backoff, load shedding and queue bounds, are important components of defense-in-depth to metastable failures. However, it is challenging for a service operator to configure these mechanisms appropriately while balancing performance and availability requirements. Even worse, there is no way for operators to have confidence that a given set of defensive mechanisms are sufficient to prevent future metastable failures. In this talk I will describe how we are tackling this problem at AWS with a suite of tools ranging from modeling the system as continuous-time Markov chains to discrete event simulation to emulation in the cloud.
Bio: Rebecca Isaacs is a Senior Principal Scientist at AWS. Her research interests span distributed and parallel systems, operating systems and networks. She previously worked for Twitter, Google and Microsoft Research, and got her PhD from University of Cambridge.