mKernel – UC Berkeley Sky Computing Lab

Fast Multi-Node, Multi-GPU Fused Kernels

mKernel is a small, focused library of persistent CUDA kernels — each of which fuses intra-node NVLink communication, inter-node RDMA, and dense compute into a single kernel.

GPU-driven networking, built on libibverbs. mKernel uses GPU-initiated RDMA writes without depending on NCCL or NVSHMEM. We find that writing the communication backend from scratch is helpful to maximize performance as well as cater to heterogenous networking devices.
Multi-GPU + multi-node, in one kernel. Intra-node NVLink and inter-node RDMA both live inside the same persistent kernel.
Fine-grained intra-kernel overlap. Compute and communication overlap at tile/chunk granularity, covering both the intra-node and inter-node GPU communication.
Persistent kernel with SM specialization. CTAs self-assign roles, such as compute, intra-comm, inter-send, inter-reduce. The split (e.g. number of SMs dedicated to each role) is tunable per shape.

Blog Post

GitHub

Contributors

Ziming Mao, Yang Zhou, Chon Lam Lao, Costin Raiciu, Scott Shenker, Ion Stoica

Publications

CoRR – Paper Title Hyperlinked