Ray Train with DDP on multi-node set-up

Hi there!

I am trying to fine-tune an LLM using HuggingFace’s SFTTrainer together with Ray Train (Ray 2.22.0) in a multi-node set up (2 nodes with 2 GPU each).
I am able to set-up the Ray Cluster correctly and set ScalingConfig(num_workers=4, use_gpu=True).

Unfortunately I get the following error before the training is able to start (after model loading):

torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5

ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.

Last error: socketPollConnect: Connect to ip_address<38715> returned 113(No route to host) errno 115(Operation now in progress)

I believe the port <8715> mentioned in the last error message is closed. Is there a way to specify the port number?
In case it is not possible to specify the port, would restricting the port range be a potential solution?
In the NCCL docs: Troubleshooting — NCCL 2.22.3 documentation (nvidia.com) they mention that:

NCCL opens TCP ports to connect processes together and exchange connection information. To restrict the range of ports used by NCCL, one can set the net.ipv4.ip_local_port_range property of the Linux kernel"

Thank you very much for any help!
Best regards

Can you try to run a simple pytorch example(e.g. this one) without SFTTrainer? Just want to check if TorchTrainer works in your cluster.

If the above example doesn’t work, then it could be an issue in the network configruation. Since I’m not an expert in network, probably you can also post it in ray core channel?

Thank you very much for your answer!
I was able to resolve the problem by restricting the port range like mentioned in the NCCL docs: Troubleshooting — NCCL 2.22.3 documentation (nvidia.com).

It would be nicer to find a way to manually set the ports via some arguments IMO …

Thank you once again!
Best