With enough Available Resources, Most of the Actors' Creation is Pending

Visual details at this Link: [Bug] With enough Available Resources, Most of the Actors' Creation is Pending · Issue #20891 · ray-project/ray · GitHub


The Problem:

With enough available resources, Actor number 13 is always pending regardless of the changes I do. Then after that, most Actors keep pending.


Resources

Resources Status during Actors Creation (12 Logical CPUs / 6 Physical Cores, 32GB RAM)


Versions / Dependencies

OS Windows 10, 64-bit

Ray version 1.9.0

Redis version 4.0.2


Reproduction script

  import time
  import ray
  
  @ray.remote(num_cpus=0.01)  # tried different values
  class Actor:
      def __init__(self):
          pass
  
  if __name__ == '__main__':
      try:
          ray.init(num_cpus=12)  # tried different values
          time.sleep(30)
          for i in range(50):
              Actor.options(name=str(i + 1), lifetime="detached").remote()
              time.sleep(10)
      except Exception as e:
          print("Exception {} ".format(str(e)))
      finally:
          ray.shutdown()

Your help is very much appreciated!

Just to confirm that this behaviour seems to be due to a bug in the num_cpus of the remote Actor.

If I change ray.init(num_cpus=12) to ray.init(num_cpus=6) then the same behaviour is triggered with Actor number 7, regardless of num_cpus in @ray.remote(num_cpus=xxx).

Your help is very much appreciated.

Just to confirm, what is the lowest num_cpus with this behavior? e.g. does ray.init(num_cpus=2) trigger this problem at actor number 3? Also, does the 30s / 10s timeouts in the reproduction script matter?

Btw I wasn’t able to reproduce this issue on MacOS. Will see if we can reproduce on Windows.

Hey. Thanks for your reply.

The time sleep is not necessary. I just saw one issue on Github that recommends giving some time for the process to complete.

Yes, the behaviour is the same. As I mentioned in my previous comment. It is always hanging at Actor number num_cpus + 1.

This is with ray.init(num_cpus=2).

Just to follow up.

I created an Ubuntu 20.04 VM in my Windows 10 OS then ran the same script. It did not show the same behaviour and all worked fine.

It seems to be strictly related to Windows 10, at least that is what experiments show.

Thanks for the investigation! We will try to reproduce and debug on our part.