Hi there!
I am trying to fine-tune an LLM using HuggingFace’s SFTTrainer together with Ray Train (Ray 2.22.0) in a multi-node set up (2 nodes with 2 GPU each).
I am able to set-up the Ray Cluster correctly and set ScalingConfig(num_workers=4, use_gpu=True).
Unfortunately I get the following error before the training is able to start (after model loading):
torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error: socketPollConnect: Connect to ip_address<38715> returned 113(No route to host) errno 115(Operation now in progress)
I believe the port <8715> mentioned in the last error message is closed. Is there a way to specify the port number?
In case it is not possible to specify the port, would restricting the port range be a potential solution?
In the NCCL docs: Troubleshooting — NCCL 2.22.3 documentation (nvidia.com) they mention that:
NCCL opens TCP ports to connect processes together and exchange connection information. To restrict the range of ports used by NCCL, one can set the
net.ipv4.ip_local_port_range
property of the Linux kernel"
Thank you very much for any help!
Best regards