How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
We are running Ray in K8s and using ray core (remote tasks). Our application uses those functions for both light and heavy workloads. The challenge we have is to limit the amount of resources used by the heavy workloads so that there are some room left for the light ones.
Upon reading the documentation, it seems that num_cpus could help. However, we have deeply nested tasks (simple example below):
@ray.remote
def f(i):
...
@ray.remote
def g():
...
@ray.remote
def h():
data_to_process = g.remote()
tasks = [g.remote(i, data_to_process) for i in range(20)]
...
My understanding is that having num_cpus=N in f, g or h (or all of them) won’t help preventing f + g + h together to use more than N CPUs. Is this a correct assumption?
Is there currently a way to make sure that f + g + h together won’t won’t use more than N CPUs for a given call to h?
I saw some discussions about task priorities, and wondering if the work on the workflows could solve those kind of issues (e.g. have a given amount of resources for a workflow run, and / or have the ability to limit the number of tasks that can run in parallel for a workflow run).
num_cpus is used for scheduling and does not provide resource guarantee nor isolation. It is possible that a single task is using more CPU than specified for scheduling, although each task runs in a separate process and typically won’t consume more than a single CPU.
I forgot to mention that we are using Ray autoscaler, with a maximum of 30 workers.
In the example you provided, the concurrency limit will work because the Ray node is static. With Ray autoscaler, additional workers are going to be launched until we reach the limit.
To illustrate with our use case:
We have a heavy batch load that must process 1000 files. The cluster autoscales up to the defined limit (30 workers).
We have lightweight tasks that can happen in parallel (just need 1 worker).
As all workers are busy, the lightweight workload needs to wait for the batch process to finish.
So the question is to know if there is a way to limit the batch process (which might be several tasks) to use just N workers (Or X total memory / CPU) instead of consuming as much resources as it can.
Where I am not super clear is how is this going to behave with Ray autoscaler. Is the placement group going to wait for N workers to be running to satisfy X resources before starting to process the tasks?