How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am trying to run the following code across a two machine cluster (head + worker).
I start the ray cluster with the following commands
head node:
ray start --head --port=6381 --node-up-address=10.42.103.1 --dashboard-host=10.42.103.1 --dashboard-post=8080 --ray-client-server-port=6380
worker node:
ray start --address=10.42.103.1:6381 --node-ip-address=10.42.103.2 --ray-client-server-port=6380
The code that I am running is the following:
import socket
import time
import numpy as np
import ray
@ray.remote
def hello(x, i):
time.sleep(30)
print(f'{socket.gethostname()}: {i} here')
return f'{socket.gethostname()}: {np.sum(x)}'
ray.init(address='ray://10.42.103.1:6380')
x = np.arange(100000)
xy = ray.put(x)
future_list = [hello.remote(xy, i) for i in range(100)]
for future in future_list:
print(ray.get(future))
When I run it… I get the “here” print from the workers run on the head node, but not the worker node.
The remote call futures return the correct return from the remote function. However, the remote calls assigned to the worker node return
Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::hello() (pid=9592, ip=10.42.103.2)
File "python\ray\_raylet.pyx", line 819, in ray._raylet.execute_task
File "python\ray\_raylet.pyx", line 843, in ray._raylet.execute_task
File "python\ray\_raylet.pyx", line 501, in ray._raylet.raise_if_dependency_failed
ray.exceptions.OwnerDiedError: Failed to retreve object 00ff.......
Looks like the remote calls to the worker node are unable to find the ObjectRef where as the head node has no problems.
When I look at the Object Store Memory on the dashboard after the ray.put call… It looks like the numpy array is making it on to the head node with memory there used. But the worker node has no memory used.
Is there something in the way I have set things up that is preventing the ray.put call to propagate the object across both the head and worker node?