Visual details at this Link: [Bug] With enough Available Resources, Most of the Actors' Creation is Pending · Issue #20891 · ray-project/ray · GitHub
The Problem:
With enough available resources, Actor number 13 is always pending regardless of the changes I do. Then after that, most Actors keep pending.
Resources
Resources Status during Actors Creation (12 Logical CPUs / 6 Physical Cores, 32GB RAM)
Versions / Dependencies
OS
Windows 10, 64-bit
Ray
version 1.9.0
Redis
version 4.0.2
Reproduction script
import time
import ray
@ray.remote(num_cpus=0.01) # tried different values
class Actor:
def __init__(self):
pass
if __name__ == '__main__':
try:
ray.init(num_cpus=12) # tried different values
time.sleep(30)
for i in range(50):
Actor.options(name=str(i + 1), lifetime="detached").remote()
time.sleep(10)
except Exception as e:
print("Exception {} ".format(str(e)))
finally:
ray.shutdown()
Your help is very much appreciated!
Just to confirm that this behaviour seems to be due to a bug in the num_cpus
of the remote Actor.
If I change ray.init(num_cpus=12)
to ray.init(num_cpus=6)
then the same behaviour is triggered with Actor number 7, regardless of num_cpus
in @ray.remote(num_cpus=xxx)
.
Your help is very much appreciated.
Just to confirm, what is the lowest num_cpus
with this behavior? e.g. does ray.init(num_cpus=2)
trigger this problem at actor number 3? Also, does the 30s / 10s timeouts in the reproduction script matter?
Btw I wasn’t able to reproduce this issue on MacOS. Will see if we can reproduce on Windows.
Hey. Thanks for your reply.
The time sleep is not necessary. I just saw one issue on Github that recommends giving some time for the process to complete.
Yes, the behaviour is the same. As I mentioned in my previous comment. It is always hanging at Actor number num_cpus
+ 1.
This is with ray.init(num_cpus=2)
.
Just to follow up.
I created an Ubuntu 20.04
VM in my Windows 10 OS then ran the same script. It did not show the same behaviour and all worked fine.
It seems to be strictly related to Windows 10, at least that is what experiments show.
Thanks for the investigation! We will try to reproduce and debug on our part.