Run LLMs, AI, and Batch Jobs Anywhere
SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution.
SkyPilot abstracts away cloud infra burdens:
- Launch jobs & clusters on any cloud
- Easy scale-out: queue and run many jobs, automatically managed
- Easy access to object stores (S3, GCS, R2)
SkyPilot maximizes GPU availability for your jobs:
- Provision in all zones/regions/clouds you have access to (the Sky), with automatic failover
SkyPilot cuts your cloud costs:
- Managed Spot: 3-6x cost savings using spot VMs, with auto-recovery from preemptions
- Optimizer: 2x cost savings by auto-picking the cheapest VM/zone/region/cloud
- Autostop: hands-free cleanup of idle clusters
SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.
Contibutors
Zongheng Yang, Zhanghao Wu, Romil Bhardwaj, Tian Xia, Ziming Mao, Tyler Griggs
Publications
NSDI 24 – Can’t Be Late: Optimizing Spot Instance Savings under Deadlines
NSDI 23 – SkyPilot: An Intercloud Broker for Sky Computing
HotOS 21 – From Cloud Computing to Sky Computing