Extremely slow multi-node comm in k8s clusters

Ze_Ch · July 22, 2024, 3:03pm

I’m a user in a k8s cluster. I applied for three machines. I started one head node using ray start --head, and started two worker nodes using ray start.

Then I ran my distributed training script using ray job submit to the head node. The distributed training script uses ray and deepspeed. Specifically, I created 24 remote actors, one for each gpu, and created 4 torch process groups, containing 6, 6, 1 ,11 actors respectively. During a training step, I run inference, compute loss and gradient, do backprop, and use torch to broadcast the model weights inside a process group. However, each training step takes half an hour to finish.

Note that the script works fine in a single node setup. To debug, I ran the same script this time on only the head node. I created 8 remote actors, one for each gpu, and created 4 torch process groups, containing 2, 2, 2, 2 actors respectively. This time, the training step runs very quick. It finishes in dozens of seconds.

I use the ray dashboard to inspect what’s wrong. I find that when the training step is stuck in multi-node settings, the Sent and Received are only around 13MB/s, strange for nccl backend (which I made sure was used in torch process groups). So I suspect that the comm between nodes is actually through ethernet? But I can’t be sure about it. Do you have similar problems with manually setting up ray clusters between k8s pods and getting comm problems?

lobanov · July 30, 2024, 9:34am

You should be able to validate whether traffic goes through NCCL by looking at respective network interfaces statistics on the nodes.

You didn’t say whether you are running your cluster on-prem or using a cloud provider, but, if it’s the latter, it may be worth checking limitations of the interfaces or if there are special steps required to enable them. For example, in AWS EKS to use OS-bypass your pods must all scheduled in the same availability zone and other limitations also apply.

Topic		Replies	Views
Examples not running across multiple nodes in a cluster	2	935	June 23, 2023
Train examples not running or showing NaN after setting placement_strategy "STRICT_SPREAD"	4	355	June 27, 2023
How to launch multi-node job with Ray Train? Ray Train	9	2110	June 14, 2024
Ray k8s cluster, communication is slow Ray Core	15	1068	June 18, 2022
The performance of Ray Cluster on K8S cluster Kubernetes	0	504	March 1, 2022

Extremely slow multi-node comm in k8s clusters

Related topics