How severe does this issue affect your experience of using Ray?
- Low: It annoys or frustrates me for a moment.
Context: I have some tasks that require a GPU node, which is started on demand using kuberay. Once that task finishes, another task which doesn’t require a GPU starts and it’s scheduled to the currently active GPU node. Ideally, I would like this task to spin-up a high-memory non-GPU node, allowing the GPU node to be deallocated. However, Ray prefers to schedule the task to the already available GPU node before it spins down.
Is there a way to define an affinity for task that prevent them to be scheduled to GPU nodes? I know it’s possible to define the specific node_id for a task. But that is not what I need, since the node_id is not known a priori, as the nodes scale up and down automatically.
Why: The main reason to do this is saving compute resources/money.
@Sam_Chan thanks and sorry for the delay in replying!
I’ve read through the Placement Groups documentation but it still isn’t clear to me how to create a placement group that enforces that tasks are scheduled to non-GPU nodes.
For example, if there is a physical machine with 8 cores and 1 GPU, and I request a PG with {"cpu" : 8, "gpu": 0}
, what prevents this from being allocated to this GPU VM?
I believe that one solution is to create a virtual resource that is only available on non-GPU machines, and then use this as a requirement (like suggested in Resources). But that does require me to change the cluster configuration, which is not ideal.
Please let me know if I am missing something, thanks!
I just realized that creating a custom resource, e.g. no_gpu: 1
to all non-gpu nodes would not work if those nodes are not already active. In my case, I keep the min_replicas
of the node pool at 0, so I don’t see how Ray would know to spin-up this pool to use to get access to the custom resource.
Oh I see, in that case it’d get placed onto GPU machine which is not what you want. Can you submit a feature request on Github to support this; this is an interesting use case.
Is there a reason you can’t use a separate cluster to get you that isolation between CPU and GPU resources?
Sorry for the late reply again, for some reason I’m not getting these notifications.
Thanks, I will create a feature request on Github.
Thanks for your suggestion. Although I could create a new cluster as you proposed, this would incur in waste of resources, as I’d need a new control plane just to have such CPU/GPU nodes isolation.