Extremely slow multi-node comm in k8s clusters

I’m a user in a k8s cluster. I applied for three machines. I started one head node using ray start --head, and started two worker nodes using ray start.

Then I ran my distributed training script using ray job submit to the head node. The distributed training script uses ray and deepspeed. Specifically, I created 24 remote actors, one for each gpu, and created 4 torch process groups, containing 6, 6, 1 ,11 actors respectively. During a training step, I run inference, compute loss and gradient, do backprop, and use torch to broadcast the model weights inside a process group. However, each training step takes half an hour to finish.

Note that the script works fine in a single node setup. To debug, I ran the same script this time on only the head node. I created 8 remote actors, one for each gpu, and created 4 torch process groups, containing 2, 2, 2, 2 actors respectively. This time, the training step runs very quick. It finishes in dozens of seconds.

I use the ray dashboard to inspect what’s wrong. I find that when the training step is stuck in multi-node settings, the Sent and Received are only around 13MB/s, strange for nccl backend (which I made sure was used in torch process groups). So I suspect that the comm between nodes is actually through ethernet? But I can’t be sure about it. Do you have similar problems with manually setting up ray clusters between k8s pods and getting comm problems?

You should be able to validate whether traffic goes through NCCL by looking at respective network interfaces statistics on the nodes.

You didn’t say whether you are running your cluster on-prem or using a cloud provider, but, if it’s the latter, it may be worth checking limitations of the interfaces or if there are special steps required to enable them. For example, in AWS EKS to use OS-bypass your pods must all scheduled in the same availability zone and other limitations also apply.