[Core] Ray start more workers than cpus

crystalww · April 7, 2021, 12:04pm

Ray 1.0.0
ray start --num-cpus=17 ...
When I run my code, It always create more workers and exhaust all memory.

sangcho · April 7, 2021, 8:43pm

Ray can create more workers than num_cpus in some scenarios;

When you call ray.get, ray.wait, or .remote function inside your task or actor. This is to avoid deadlock. Imagine a scenario where you have 2 cpus, 2 tasks, and one of task needs to submit another task that requires 1 cpu. If you don’t start a new worker, this task will hang forever. This is called “borrowing resources”.
When your worker owns an object. This is an internal detail, but if your worker creates nested tasks, the worker owns the object from that time. Those workers won’t be killed (more details will be in Ray 1.0 Architecture - Google Docs).

sangcho · April 7, 2021, 8:44pm

Also note that Ray dashboard shows the real number of cpus instead of num_cpus you provided through ray start. (This is something we should fix).

crystalww · April 8, 2021, 1:59am

I’m sure I didn’t create any nested tasks or actors. Now assume I have 300 actors to create, 10 Nodes (each node has 15 cpus), I create 150 actors at first and then create a actors as long as one finished. What I expected is that each node will use 15 workers and using 100% cpu, but the actual is that some nodes create more workers, each worker using 45%~60% cpu.

Note that I started ray with --num-cpus and didn’t set num_cpus in ray.remote, or create any nested tasks/actors.

sangcho · April 8, 2021, 6:25pm

Ah, in that case, the default num_cpus for actors are 0. It will be probably resolved if you set ray.remote(num_cpus=1) for actor decorators.

marrocksd · April 10, 2021, 7:26am

Something that I experienced:

I can start more workers than cores ( assume that each has 1 task running), but if I do, the exceeding ones will be blocked. Tested with direct style, I don’t know how it will behave in the nested style
If the task is async, I can start thousands without blocking.
An actor will be counted as a worker, but if I remote many parallel works to that actor, it’s still counted as 1 worker desplite using many cores. So I don’t think there is any strong constraint between workers and cores

Topic		Replies	Views
Too many actors cause worker is done Ray Core	1	34	July 8, 2025
Understanding task and worker allocation Ray Core	3	1535	September 21, 2021
Specify the number of worker processes Ray Core	10	16496	May 12, 2021
Creating actors when their amount is more than `num_cpus` Ray Core	8	4322	April 29, 2021
Best way to config ray workers Ray Core	6	451	February 26, 2021

[Core] Ray start more workers than cpus

Related topics