How to prevent scheduling non-GPU tasks to GPU nodes

eduardoarnold · July 5, 2024, 3:59pm

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

Context: I have some tasks that require a GPU node, which is started on demand using kuberay. Once that task finishes, another task which doesn’t require a GPU starts and it’s scheduled to the currently active GPU node. Ideally, I would like this task to spin-up a high-memory non-GPU node, allowing the GPU node to be deallocated. However, Ray prefers to schedule the task to the already available GPU node before it spins down.

Is there a way to define an affinity for task that prevent them to be scheduled to GPU nodes? I know it’s possible to define the specific node_id for a task. But that is not what I need, since the node_id is not known a priori, as the nodes scale up and down automatically.

Why: The main reason to do this is saving compute resources/money.

Sam_Chan · July 8, 2024, 4:21pm

@eduardoarnold see Placement Groups: Placement Groups — Ray 2.31.0

eduardoarnold · August 1, 2024, 5:49pm

@Sam_Chan thanks and sorry for the delay in replying!

I’ve read through the Placement Groups documentation but it still isn’t clear to me how to create a placement group that enforces that tasks are scheduled to non-GPU nodes.

For example, if there is a physical machine with 8 cores and 1 GPU, and I request a PG with {"cpu" : 8, "gpu": 0}, what prevents this from being allocated to this GPU VM?

I believe that one solution is to create a virtual resource that is only available on non-GPU machines, and then use this as a requirement (like suggested in Resources). But that does require me to change the cluster configuration, which is not ideal.

Please let me know if I am missing something, thanks!

eduardoarnold · August 1, 2024, 8:49pm

I just realized that creating a custom resource, e.g. no_gpu: 1 to all non-gpu nodes would not work if those nodes are not already active. In my case, I keep the min_replicas of the node pool at 0, so I don’t see how Ray would know to spin-up this pool to use to get access to the custom resource.

Sam_Chan · September 3, 2024, 4:42pm

Oh I see, in that case it’d get placed onto GPU machine which is not what you want. Can you submit a feature request on Github to support this; this is an interesting use case.

Is there a reason you can’t use a separate cluster to get you that isolation between CPU and GPU resources?

eduardoarnold · September 30, 2024, 4:52pm

Sorry for the late reply again, for some reason I’m not getting these notifications.

Thanks, I will create a feature request on Github.

Thanks for your suggestion. Although I could create a new cluster as you proposed, this would incur in waste of resources, as I’d need a new control plane just to have such CPU/GPU nodes isolation.

eduardoarnold · September 30, 2024, 6:42pm

For completeness, I have created the Github issue here: [Core] Prevent schedulling non-GPU tasks to GPU nodes · Issue #47866 · ray-project/ray · GitHub

Topic		Replies	Views
Optimizing GPU Scheduling Based on Interconnect Topology	1	55	January 13, 2025
Ray actor CPU affinity Ray Core	3	48	December 17, 2024
How to assign actors to specific machines? Ray Core	2	320	January 8, 2024
Submit remote work to a specific worker Ray Core	8	621	September 26, 2023
Can we make ray evenly schedule tasks on different GPUs? Ray Core	3	313	January 11, 2021

How to prevent scheduling non-GPU tasks to GPU nodes

Related topics