Creating actors when their amount is more than `num_cpus`

Hi guys,

I would like to find out the behavior of two cases. Are those correct?

The first:

import ray
ray.init(num_cpus=2)

@ray.remote(num_cpus=1)
class Foo():
    pass

o1 = Foo.remote()
o2 = Foo.remote()
o3 = Foo.remote()
o4 = Foo.remote()
2021-04-21 02:42:09,463 WARNING worker.py:1091 -- The actor or task with ID ffffffffffffffff7bbd902801000000 is pending and cannot currently be scheduled. It requires {CPU: 1.000000} for execution and {CPU: 1.000000} for placement, but this node only has remaining {node:10.241.129.69: 1.000000}, {object_store_memory: 128.515625 GiB}, {memory: 3669.873047 GiB}. In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

Why does the warning appear after creating fourth actor, but not after third one?

The second:

import ray
ray.init(num_cpus=2)

@ray.remote # no `num_cpus` specified
class Foo():
    pass

o1 = Foo.remote()
o2 = Foo.remote()
o3 = Foo.remote()
o4 = Foo.remote()
o5 = Foo.remote()
o6 = Foo.remote()
2021-04-21 02:46:53,974        WARNING worker.py:1091 -- WARNING: 6 PYTHON workers have been started. This could be a result of using a large number of actors, or it could be a consequence of using nested tasks (see https://github.com/ray-project/ray/issues/3644) for some a discussion of workarounds.
o7 = Foo.remote()
2021-04-21 02:51:04,418        WARNING worker.py:1091 -- WARNING: 7 PYTHON workers have been started. This could be a result of using a large number of actors, or it could be a consequence of using nested tasks (see https://github.com/ray-project/ray/issues/3644) for some a discussion of workarounds.

Why do the warnings appear after creating sixth and seventh actors, but not after third and fourth ones? Where will extra actors (o3, o4, o5, o6, 07) be placed? Will each one take 1 core?

My machine has 112 cores.

Thanks in advance!

Why do the warnings appear after creating sixth and seventh actors, but not after third and fourth ones? Where will extra actors (o3, o4, o5, o6, 07) be placed? Will each one take 1 core?

I think actually this is produced after a factor > num_cpus.
By default, these actors will take 0 cores. The warning is largely harmless.

cc @Alex who will know more about this.

For your first example, it looks like it just takes some time for the warning to be raised:

In [1]: import ray
   ...: ray.init(num_cpus=2)
   ...:
   ...: @ray.remote(num_cpus=1)
   ...: class Foo():
   ...:     pass
   ...:
   ...: o1 = Foo.remote()
   ...: o2 = Foo.remote()
   ...: o3 = Foo.remote()
2021-04-21 17:45:41,083	INFO services.py:1264 -- View the Ray dashboard at http://127.0.0.1:8265

In [2]: 2021-04-21 17:46:01,329	WARNING worker.py:1086 -- The actor or task with ID ffffffffffffffffcd86a500eaecc23cced91d0101000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {0.000000/2.000000 CPU, 27.478926 GiB/27.478926 GiB memory, 13.739463 GiB/13.739463 GiB object_store_memory, 1.000000/1.000000 node:192.168.1.115}
. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
1 Like

What factor are you saying about?
With respect to the warning, I am okay with it. I am more interested in the questions:

  1. Where will extra actors (o3, o4, o5, o6, 07) be placed?
  2. Will each one take 1 core?
  3. Will Ray take more cpus than initially specified 2?

Ray doesn’t have resource isolation. So extra actors will be placed on the same node (if they are placed at all). and Ray may take more resources than initially specified 2.

ray.init(num_cpus=X) is really just used for accounting purposes, not resource isolation.

1 Like

@rliaw Is it wrong to expect that e.g. specifying 16 total CPUs and 1 GPU in ray.init and setting resources_per_trial as {‘cpu’: 4, ‘gpu’: 1} would result in only 1 trial being run at a time?
Surprisingly, for me it can even result in fractional GPU usage: Resources requested: 2.0/6 CPUs, 0.25/1 GPUs, 0.0/246.97 GiB heap, 0.0/75.78 GiB objects (0/1.0 accelerator_type:V100)
This then results in multiple Trainable.train_buffered() workers to be instantiated, often resulting in CUDA out of memory errors.
How can one force the Ray Tune to strictly adhere to the resources_per_trial argument?

Thanks!

Hey @FarzanT that looks like a bug.

Could you help file an issue on Github?

@rliaw I think this issue arises when I resume trials; I thought that only changes to hyper-parameters are ignored, but apparently changes to resource allocation per trial is also ignored? How can we distinguish the two? It would be beneficial/less confusing if newly specified resource allocations are adhered to.

I think everything is ignored when resuming trials right now.

We could explore some alternative behavior here – but maybe let’s do that on github?

1 Like