I’m a user in a k8s cluster. I applied for three machines. I started one head node using ray start --head, and started two worker nodes using ray start.
Then I ran my distributed training script using ray job submit to the head node. The distributed training script uses ray and deepspeed. Specifically, I created 24 remote actors, one for each gpu, and created 4 torch process groups, containing 6, 6, 1 ,11 actors respectively. During a training step, I run inference, compute loss and gradient, do backprop, and use torch to broadcast the model weights inside a process group. However, each training step takes half an hour to finish.
Note that the script works fine in a single node setup. To debug, I ran the same script this time on only the head node. I created 8 remote actors, one for each gpu, and created 4 torch process groups, containing 2, 2, 2, 2 actors respectively. This time, the training step runs very quick. It finishes in dozens of seconds.
I use the ray dashboard to inspect what’s wrong. I find that when the training step is stuck in multi-node settings, the Sent and Received are only around 13MB/s, strange for nccl backend (which I made sure was used in torch process groups). So I suspect that the comm between nodes is actually through ethernet? But I can’t be sure about it. Do you have similar problems with manually setting up ray clusters between k8s pods and getting comm problems?