How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hello,
I’ve an On-premise ray cluster (i.e. provider.type: local). I’d like to declare different worker node types in the cluster (e.g. “CPU only nodes” and “GPU nodes”). I tried adding an available_node_type section in the cluster configuration as follows.
When I ran ray up to start the cluster, I got the following error:
The field available_node_types is not supported for on-premise clusters.
Is there way to declare different node types on on-premise clusters?
The workaround that I’m thinking about ATM is to create a Ray cluster for each worker node type and have something external (to Ray) to schedule the workloads to the correct Ray cluster.
We are running into a similar problem - except ours revolves around trying to launch multiple worker nodes on the same server in an effort to isolate resources.
Right now, some of the nodes in our cluster have 32 cpus and 3 gpus. If we try to let the node be a single worker with 32 cpus and 3 gpus, then any task that needs 1 cpu and 1 gpu will take a worker and start a pytorch process on one of the gpus. As the node processes multiple trainings, it starts handing trains to any of the 32 cpus, which each cache pytorch mem on one of the 3 gpus (which is a performance feature of pytorch). This leads to an OOM on the gpus since all 32 cpus have reserved their own mem on the 3 gpus.
One way around this is to use PYTORCH_NO_CUDA_MEMORY_CACHING=1 but it results in severe performance degradation (it is really only designed for debugging purposes) by up to 50%.
So the final solution seems to be that we need to isolate resources by creating two workers on a single node: one “trainer” worker that has 3 cpu and 3 gpus, while another “cpu” worker with resources set to 29 cpus so that single cpu tasks can still be highly parallelized.