I’m using Ray on a cluster with a variable number of GPUs per node. I want to run a task on each node, and let it consume all the GPUs on that node. I have defined a custom node
resource to make sure tasks are not ran in parallel on the same node. However, if I don’t set num_gpus
for the remote function, Ray sets CUDA_VISIBLE_DEVICES
to an empty string. So I’m forced to proved some num_gpus
value, which leaves some nodes underutilized.
Could someone with any of these questions:
- Can I specify a flexible number of GPUs?
or - How to stop Ray from editing CUDA_VISIBLE_DEVICES?
Thank you for your help!