Understanding task and worker allocation

Hi,
The documentation(Using Actors — Ray v1.6.0) says:

Tasks: When Ray starts on a machine, a number of Ray workers will be started automatically (1 per CPU by default). They will be used to execute tasks (like a process pool). If you execute 8 tasks with num_cpus=2, and total number of CPUs is 16 (ray.cluster_resources()[“CPU”] == 16), you will end up with 8 of your 16 workers idling.

I thought there would no workers idling since each task acquires 2 CPUS so all CPUS/workers would utilized?
or does it mean that all 8 tasks are on the same 2 CPUs then that’s 14 CPUs/workers idling?

Hi @rabraham

That scenario is saying that ray will start with a total capacity of 16 cpus and it will have 16 workers that are idle but ready to do work (think execution engine processes).

Now you request 8 tasks and you tell ray that each task will require 2cpus. So ray starts the 8 tasks on 8 of the idle workers but since 8 *2=16, even though you have more idle workers you have told ray that a subset of them are using all of the available cpu resources. If you tried to start a 9th task it would have to wait until one of the first 8 finishes. After that there would then be enough cpus to start the next task on a worker.

One other thing to keep in mind is that ray has no way of enforcing the resource utilization it is strictly for budgeting and orchestration. If you say each task only uses at most 2 cpus but in reality it is using 4, ray will not know or do anything about that.

1 Like

Thank you @mannyv,

A few questions:

  • A single Task Invocation will only take one CPU. Taking a fresh example(no num_cpus value), If I have 16 CPUS and I invoke a task t1 16 times. All 16 CPUs will be used?

  • If each task execution runs on a single Ray worker on a single CPU, when does it make sense to give it num_cpus > 1. For e.g. if I invoke a task with num_cpus=3, then my understanding is that it’ll reserve 3 CPUs for that task invocation but the above example leads me to believe that only 1 CPU will be used and 2 CPUs will remain idle? In what scenario does one want that?

It depends. Say if t1 needs 1min to finish and you call t1.remote() 16 times, all will be used. Otherwise, if they run one-by-one, only one will be used.

It still makes sense. Ray doesn’t do cpu binding kind of things for workers. Resources are just about numbers. For example, think you have 16 CPUS:

ray.get([t1.options(num_cpus=1).remote() for _ in range(16)]) # 16 t1 will run in parallel
ray.get([t1.options(num_cpus=4).remote() for _ in range(16)]) # At most, 4 t1 will run in parallel.

Inside t1, you can do anything you like, for example, you can create 4 sub-process and do something.

1 Like