Worker node unable to retrieve object

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am trying to run the following code across a two machine cluster (head + worker).

I start the ray cluster with the following commands
head node:

ray start --head --port=6381 --node-up-address=10.42.103.1  --dashboard-host=10.42.103.1 --dashboard-post=8080 --ray-client-server-port=6380

worker node:

ray start --address=10.42.103.1:6381  --node-ip-address=10.42.103.2 --ray-client-server-port=6380

The code that I am running is the following:

import socket
import time
import numpy as np
import ray

@ray.remote
def hello(x, i):
    time.sleep(30)
    print(f'{socket.gethostname()}: {i} here')
    return f'{socket.gethostname()}: {np.sum(x)}'

ray.init(address='ray://10.42.103.1:6380')
x = np.arange(100000)
xy = ray.put(x)

future_list = [hello.remote(xy, i) for i in range(100)]

for future in future_list:
    print(ray.get(future))

When I run it… I get the “here” print from the workers run on the head node, but not the worker node.

The remote call futures return the correct return from the remote function. However, the remote calls assigned to the worker node return

Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::hello() (pid=9592, ip=10.42.103.2)
File "python\ray\_raylet.pyx", line 819, in ray._raylet.execute_task
File "python\ray\_raylet.pyx", line 843, in ray._raylet.execute_task
File "python\ray\_raylet.pyx", line 501, in ray._raylet.raise_if_dependency_failed
ray.exceptions.OwnerDiedError:  Failed to retreve object 00ff.......

Looks like the remote calls to the worker node are unable to find the ObjectRef where as the head node has no problems.

When I look at the Object Store Memory on the dashboard after the ray.put call… It looks like the numpy array is making it on to the head node with memory there used. But the worker node has no memory used.

Is there something in the way I have set things up that is preventing the ray.put call to propagate the object across both the head and worker node?

The error message is a bit misleading but I think this is most likely happening because the worker node cannot contact the driver process on the head node. Can you make sure that all ports on the worker and head nodes are open to each other?

Update: This is due to lack of Windows and OSX support for multinode Ray. It should work but unfortunately it fails once you try to run an application.

We’re tracking the issue here. For now, a possible workaround is to run Ray scripts from the head node and explicitly pass the public IP of the head node to ray.init:

ray.init(_node_ip_address="x.x.x.x")