UCCL – UC Berkeley Sky Computing Lab

An Efficient Collective Communication Library for GPUs

Existing network transports under NCCL (i.e., kernel TCP and RDMA) leverage one or few network paths to stream huge data volumes, thus prone to congestion happening in datacenter networks. Instead, UCCL employs packet spraying in software to leverage abundant network paths to avoid “single-path-of-congestion”. With this design, UCCL provides the following benefits:

Open-source research platform for ML collectives
Faster collectives by leveraging multi-path
Widely available in the public cloud by leveraging legacy NICs and Ethernet fabric
Evolvable transport designs including multi-path load balancing and congestion control

GitHub

Blog Post

Twitter

Contributors

Yang Zhou, Zhongjie Chen, Ziming Mao, ChonLam Lao, Shuo Yang, Pravein Govindan Kannan, Jiaqi Gao, Yilong Zhao, Yongji Wu, Kaichao You, Fengyuan Ren, Zhiying Xu, Costin Raiciu, Ion Stoica

Publications

CoRR – An Extensible Software Transport Layer for GPU Networking