Sky Systems Seminar: Aurojit Panda (NYU) – Detecting and Reacting to Errors in Distributed Systems at Runtime

Speaker: Aurojit Panda

Location: Soda 510

Date: October 6, 2023

Time: 11am-12pm PST

Title:

Detecting and Reacting to Errors in Distributed Systems at Runtime

Abstract: Correctly implementing a distributed system is hard. While several recent proposals, e.g., Dafny, have described methods for building provably correct  distributed systems they have not seen wide adoption due to  a lack of developer expertise and concerns about the resulting system’s performance. Consequently, most distributed systems are written manually by developers and often contain bugs. Bugs continue to be found even in systems such as Zookeeper and Etcd that have been around for nearly a decade and are widely-deployed, and these bugs can have wide-range effects. In this talk I am going to describe ongoing work that shows how to detect bugs in deployed distributed systems at runtime. Our efforts have been focused on developing an approach that does not require changes to the distributed system, requires no additional messages or coordination, has minimal performance overheads, and allows bugs to be detected soon after they occur. I will then describe challenges in using such a detector to improve a deployed system’s resilience.

Bio: Aurojit Panda is an assistant professor in the Computer Science department at New York University.  He works in systems and networking, though he borrows (or steals) ideas and problems from several fields, including formal methods, programming languages, graphics and machine learning. He received his PhD in 2017 from UC Berkeley, where he was advised by Scott Shenker.  He has received several awards, including  a VMware Early Career Faculty Award, a Google Research Scholar Award, a NSF Career award, best paper awards at EuroSys, SIGCOMM and OSDI, and a EuroSys test of time award.