Ray status does not see worker node

I am currently testing a hybrid infrastructure where the head node of the ray cluster is running on an AWS EC2 instance and the worker node is my local computer.

When I connect the worker node using the public IP address of the EC2 head node, no error message appears but when I use the “ray status” command, I can only see the head node. The connection to the cluster seems to be working because “ray status” on my local computer returns the correct resources of the head node, but nothing about my local worker node.

Also, I can successfully connect to the cluster with a python application using the “ray.init(address=…)” command and I can see both the head node AND worker node when I run the “ray.nodes()” python command. But again, running the distributed app only result in the head node working and not the worker node.

Could it be an issue with the autoscaler not seeing the worker?

cc @Dmitri @Alex for a possible autoscaler issue.

1 Like

Hi @Thomas_Rochefort-Bea, could you provide more details on your configuration –
such as Ray versions, operating systems used, any networking details you think might be relevant.

Also, could you check tmp/ray/session_latest/logs for anything weird – in particular, monitor.* are the monitor logs.

Another thing to do is to run ray.cluster_resources() to see if the output indicates the worker is connected.

Finally, could you describe your use case? Would using Ray client on a laptop work in place of using the laptop as a worker?

Hi Dmitri,

The use case is as follows:

I want to connect multiple worker nodes (all on different public IPs) to one central head node on an EC2 instance. Then I would like to allow any other computer (different from worker and head node) to connect as a driver and send jobs to the cluster.

I understand now that I have to do a complex setup with port forwarding to go through the various firewall.

Any information/tips on this setup?

Hello @Dmitri !

Any idea how one could make such a decentralized setup work?

Thanks,
Thomas

One option is to manually set up the cluster by running ray start --head on the head and ray start --adddress=<head_address>:6379 on all the workers, assuming the networking is set up to for the workers to be able to reach the head at the given address.

To execute work remotely on the head, some choices would be Ray client and Ray job submission. Ray client uses a gRPC server running on the head at port 10001 by default.
Ray job submission uses the Ray dashboard http server which runs on the head at port 8265 by default.
Again your networking set up would need to expose the relevant ports on the Ray head for remote access.

Manual Ray cluster setup:
https://docs.ray.io/en/latest/cluster/cloud.html#manual-cluster
Ray client:
https://docs.ray.io/en/latest/cluster/ray-client.html
Job submission:
https://docs.ray.io/en/latest/cluster/job-submission.html