Fast Multi-Node, Multi-GPU Fused Kernels

mKernel is a small, focused library of persistent CUDA kernels — each of which fuses intra-node NVLink communication, inter-node RDMA, and dense compute into a single kernel.
- GPU-driven networking, built on
libibverbs. mKernel uses GPU-initiated RDMA writes without depending on NCCL or NVSHMEM. We find that writing the communication backend from scratch is helpful to maximize performance as well as cater to heterogenous networking devices. - Multi-GPU + multi-node, in one kernel. Intra-node NVLink and inter-node RDMA both live inside the same persistent kernel.
- Fine-grained intra-kernel overlap. Compute and communication overlap at tile/chunk granularity, covering both the intra-node and inter-node GPU communication.
- Persistent kernel with SM specialization. CTAs self-assign roles, such as
compute,intra-comm,inter-send,inter-reduce. The split (e.g. number of SMs dedicated to each role) is tunable per shape.
Contributors
Ziming Mao, Yang Zhou, Chon Lam Lao, Costin Raiciu, Scott Shenker, Ion Stoica
Publications
CoRR – Paper Title Hyperlinked