Ray status does not see worker node

Thomas_Rochefort-Bea · January 10, 2022, 5:05pm

I am currently testing a hybrid infrastructure where the head node of the ray cluster is running on an AWS EC2 instance and the worker node is my local computer.

When I connect the worker node using the public IP address of the EC2 head node, no error message appears but when I use the “ray status” command, I can only see the head node. The connection to the cluster seems to be working because “ray status” on my local computer returns the correct resources of the head node, but nothing about my local worker node.

Also, I can successfully connect to the cluster with a python application using the “ray.init(address=…)” command and I can see both the head node AND worker node when I run the “ray.nodes()” python command. But again, running the distributed app only result in the head node working and not the worker node.

Could it be an issue with the autoscaler not seeing the worker?

Clark_Zinzow · January 10, 2022, 8:31pm

cc @Dmitri @Alex for a possible autoscaler issue.

Dmitri · January 12, 2022, 9:12pm

Hi @Thomas_Rochefort-Bea, could you provide more details on your configuration –
such as Ray versions, operating systems used, any networking details you think might be relevant.

Also, could you check tmp/ray/session_latest/logs for anything weird – in particular, monitor.* are the monitor logs.

Another thing to do is to run ray.cluster_resources() to see if the output indicates the worker is connected.

Finally, could you describe your use case? Would using Ray client on a laptop work in place of using the laptop as a worker?

Thomas_Rochefort-Bea · February 7, 2022, 4:13pm

Hi Dmitri,

The use case is as follows:

I want to connect multiple worker nodes (all on different public IPs) to one central head node on an EC2 instance. Then I would like to allow any other computer (different from worker and head node) to connect as a driver and send jobs to the cluster.

I understand now that I have to do a complex setup with port forwarding to go through the various firewall.

Any information/tips on this setup?

Thomas_Rochefort-Bea · April 7, 2022, 1:14am

Hello @Dmitri !

Any idea how one could make such a decentralized setup work?

Thanks,
Thomas

Dmitri · April 28, 2022, 3:09am

One option is to manually set up the cluster by running ray start --head on the head and ray start --adddress=<head_address>:6379 on all the workers, assuming the networking is set up to for the workers to be able to reach the head at the given address.

To execute work remotely on the head, some choices would be Ray client and Ray job submission. Ray client uses a gRPC server running on the head at port 10001 by default.
Ray job submission uses the Ray dashboard http server which runs on the head at port 8265 by default.
Again your networking set up would need to expose the relevant ports on the Ray head for remote access.

Manual Ray cluster setup:
https://docs.ray.io/en/latest/cluster/cloud.html#manual-cluster
Ray client:
https://docs.ray.io/en/latest/cluster/ray-client.html
Job submission:
https://docs.ray.io/en/latest/cluster/job-submission.html

Chengcheng_Pei · July 15, 2024, 5:20pm

did you make the decentralized setup work?
how did you expose your local computer to public as a ray worker so that the ray cluster can connect to your local computer?

Topic		Replies	Views
Workers Not Recognized on new Cluster Ray Clusters	5	582	March 3, 2023
Ray cluster's worker node is pending Ray Clusters	2	1216	February 8, 2022
Local cluster with multiple nodes in YAML config, while there's only head being started... Any hints? Ray Clusters	11	1626	June 17, 2022
Autoscaler spawns workers, but they aren't set up correctly and/or head cannot connect to them Ray Clusters	0	336	May 28, 2021
Worker node workers/cores aren't not working	1	596	May 2, 2022

Ray status does not see worker node

Related topics