An Efficient Collective Communication Library for GPUs

Existing network transports under NCCL (i.e., kernel TCP and RDMA) leverage one or few network paths to stream huge data volumes, thus prone to congestion happening in datacenter networks. Instead, UCCL employs packet spraying in software to leverage abundant network paths to avoid “single-path-of-congestion”. With this design, UCCL provides the following benefits:
- Open-source research platform for ML collectives
- Faster collectives by leveraging multi-path
- Widely available in the public cloud by leveraging legacy NICs and Ethernet fabric
- Evolvable transport designs including multi-path load balancing and congestion control
Contributors
Yang Zhou, Zhongjie Chen, Kaichao You, Costin Raiciu, Ion Stoica