How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi, I have a multi-node cluster, that has been initialized by torchrun with a NCCL backend (I am not using ray’s torch trainer class or functionality). I am only using ray.data.read_parquet
and map_batches()
to read / partition the data across all the GPUs. How do I initialize the ray client in a way that doesn’t take up too many resources away from NCCL? I know ray will try to allocate all the available CPU cores by default but that’s not what I want.
My thinking is: ray start --head
on node rank 0, followed by ray start --address=HEAD_NODE_IP_ADDRESS:HEAD_NODE_PORT
on the other nodes, and then calling ray.init(address="auto")
in the actual python code. is this sufficient? so that when I call .map_batches()
ray will automatically know to source all the GPUs in the NCCL environment (which I have configured to be using the same underlying resources for ray). most importantly, I do not want ray to spawn up a thousand processes and then use up all the resources for itself, which will cause the actual training to stop