Initializing ray in multi-node environment with NCCL

j93hahn · February 14, 2025, 10:49pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi, I have a multi-node cluster, that has been initialized by torchrun with a NCCL backend (I am not using ray’s torch trainer class or functionality). I am only using ray.data.read_parquet and map_batches() to read / partition the data across all the GPUs. How do I initialize the ray client in a way that doesn’t take up too many resources away from NCCL? I know ray will try to allocate all the available CPU cores by default but that’s not what I want.

My thinking is: ray start --head on node rank 0, followed by ray start --address=HEAD_NODE_IP_ADDRESS:HEAD_NODE_PORT on the other nodes, and then calling ray.init(address="auto") in the actual python code. is this sufficient? so that when I call .map_batches() ray will automatically know to source all the GPUs in the NCCL environment (which I have configured to be using the same underlying resources for ray). most importantly, I do not want ray to spawn up a thousand processes and then use up all the resources for itself, which will cause the actual training to stop

jjyao · March 13, 2025, 7:29pm

can you start ray with --num-cpus to limit the CPUs that are available to Ray?

Topic		Replies	Views
Issues in ray.init() functionality	1	461	December 21, 2020
How to launch multi-node job with Ray Train? Ray Train	9	2083	June 14, 2024
Ray on slurm - Problems with initialization Ray Clusters	6	3650	December 29, 2022
Ray Train with DDP on multi-node set-up	2	785	September 11, 2024
Most efficient way to use only a CPU for training RLlib	3	3116	April 22, 2021

Initializing ray in multi-node environment with NCCL

Related topics