Limit resources for a group of tasks

Cedric · November 15, 2022, 8:51pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

We are running Ray in K8s and using ray core (remote tasks). Our application uses those functions for both light and heavy workloads. The challenge we have is to limit the amount of resources used by the heavy workloads so that there are some room left for the light ones.

Upon reading the documentation, it seems that num_cpus could help. However, we have deeply nested tasks (simple example below):

@ray.remote
def f(i):
    ...

@ray.remote
def g():
    ...

@ray.remote
def h():
    data_to_process = g.remote()
    tasks = [g.remote(i, data_to_process) for i in range(20)]
    ...

My understanding is that having num_cpus=N in f, g or h (or all of them) won’t help preventing f + g + h together to use more than N CPUs. Is this a correct assumption?

Is there currently a way to make sure that f + g + h together won’t won’t use more than N CPUs for a given call to h?

I saw some discussions about task priorities, and wondering if the work on the workflows could solve those kind of issues (e.g. have a given amount of resources for a workflow run, and / or have the ability to limit the number of tasks that can run in parallel for a workflow run).

ClarenceNg · November 15, 2022, 9:10pm

@Cedric thanks for feedback

num_cpus is used for scheduling and does not provide resource guarantee nor isolation. It is possible that a single task is using more CPU than specified for scheduling, although each task runs in a separate process and typically won’t consume more than a single CPU.

The recommended way to limit concurrency is by adjusting num_cpus, there is more discussion of this topic in Pattern: Using resources to limit the number of concurrently running tasks — Ray 2.1.0

Cedric · November 15, 2022, 10:11pm

Thanks @ClarenceNg.

I forgot to mention that we are using Ray autoscaler, with a maximum of 30 workers.

In the example you provided, the concurrency limit will work because the Ray node is static. With Ray autoscaler, additional workers are going to be launched until we reach the limit.

To illustrate with our use case:

We have a heavy batch load that must process 1000 files. The cluster autoscales up to the defined limit (30 workers).
We have lightweight tasks that can happen in parallel (just need 1 worker).
As all workers are busy, the lightweight workload needs to wait for the batch process to finish.

So the question is to know if there is a way to limit the batch process (which might be several tasks) to use just N workers (Or X total memory / CPU) instead of consuming as much resources as it can.

yic · November 16, 2022, 9:57pm

Hi @Cedric
I think you probably can give placement group a try.

Maybe you can create a placement group only with X cpus and schedule your batch task on that.

Later when you have light weight tasks, you’ll use the rest cpus.

Cedric · November 22, 2022, 1:40am

Thanks @yic, I will give it a try for sure.

Where I am not super clear is how is this going to behave with Ray autoscaler. Is the placement group going to wait for N workers to be running to satisfy X resources before starting to process the tasks?

dirtyValera · November 23, 2022, 10:48am

@ClarenceNg , @yic , @Cedric +1 to the question above.

Another question:
how to change placement group’s size dynamically?

Topic		Replies	Views
Resource limits in ray.remote functions Ray Core	8	645	February 9, 2021
How to limit Ray CPU utilization? Ray Core	1	688	March 23, 2024
Restricting head node usage Ray Core	5	886	July 15, 2022
Ray Worker Max Memory Ray Core	3	484	February 5, 2021
Specify the number of worker processes Ray Core	10	16285	May 12, 2021

Limit resources for a group of tasks

Related topics