Ray Train with DDP on multi-node set-up

ANN · September 4, 2024, 8:06am

Hi there!

I am trying to fine-tune an LLM using HuggingFace’s SFTTrainer together with Ray Train (Ray 2.22.0) in a multi-node set up (2 nodes with 2 GPU each).
I am able to set-up the Ray Cluster correctly and set ScalingConfig(num_workers=4, use_gpu=True).

Unfortunately I get the following error before the training is able to start (after model loading):

torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5

ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.

Last error: socketPollConnect: Connect to ip_address<38715> returned 113(No route to host) errno 115(Operation now in progress)

I believe the port <8715> mentioned in the last error message is closed. Is there a way to specify the port number?
In case it is not possible to specify the port, would restricting the port range be a potential solution?
In the NCCL docs: Troubleshooting — NCCL 2.22.3 documentation (nvidia.com) they mention that:

NCCL opens TCP ports to connect processes together and exchange connection information. To restrict the range of ports used by NCCL, one can set the net.ipv4.ip_local_port_range property of the Linux kernel"

Thank you very much for any help!
Best regards

yunxuanx · September 4, 2024, 9:50pm

Can you try to run a simple pytorch example(e.g. this one) without SFTTrainer? Just want to check if TorchTrainer works in your cluster.

If the above example doesn’t work, then it could be an issue in the network configruation. Since I’m not an expert in network, probably you can also post it in ray core channel?

ANN · September 11, 2024, 6:21am

Thank you very much for your answer!
I was able to resolve the problem by restricting the port range like mentioned in the NCCL docs: Troubleshooting — NCCL 2.22.3 documentation (nvidia.com).

It would be nicer to find a way to manually set the ports via some arguments IMO …

Thank you once again!
Best

Topic		Replies	Views
Torch DDP backend performance issues Ray Tune	0	566	April 5, 2022
Initializing ray in multi-node environment with NCCL Ray Clusters	1	127	March 13, 2025
Resource Allocation Issue When Using TorchTrainer with Tuner Ray Tune	0	28	August 17, 2024
[RaySGD] Communication Backend in RaySGD Ray Train	2	541	December 7, 2021
### ncclInternalError: Internal check failed. Proxy Call to rank 0 failed (Connect) After setting up ray cluster with 2 nodes of single gpu & also direct pytroch distributed run … with the same nodes i got my distributed process registered. starting with	0	945	March 14, 2023

Ray Train with DDP on multi-node set-up

Related topics