Initializing ray in multi-node environment with NCCL

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi, I have a multi-node cluster, that has been initialized by torchrun with a NCCL backend (I am not using ray’s torch trainer class or functionality). I am only using ray.data.read_parquet and map_batches() to read / partition the data across all the GPUs. How do I initialize the ray client in a way that doesn’t take up too many resources away from NCCL? I know ray will try to allocate all the available CPU cores by default but that’s not what I want.

My thinking is: ray start --head on node rank 0, followed by ray start --address=HEAD_NODE_IP_ADDRESS:HEAD_NODE_PORT on the other nodes, and then calling ray.init(address="auto") in the actual python code. is this sufficient? so that when I call .map_batches() ray will automatically know to source all the GPUs in the NCCL environment (which I have configured to be using the same underlying resources for ray). most importantly, I do not want ray to spawn up a thousand processes and then use up all the resources for itself, which will cause the actual training to stop

can you start ray with --num-cpus to limit the CPUs that are available to Ray?