Local cluster with multiple nodes in YAML config, while there's only head being started... Any hints?

sebzur · January 21, 2021, 10:36pm

Hi,

I’ve just started experimenting with cluster and autoscalers and have the following cluster.yaml entries:

min_workers: 2
initial_workers: 2
max_workers: 2

provider:
    type: local
    head_ip: 172.17.1.97
    worker_ips: [172.17.1.99, 172.17.1.93]

I’m able to run ray up cluster.yaml and the cluster head node starts (image is downloaded and head container is initialized. I can connect to dashboard and see one worker ready to handle tasks (172.17.1.99), however no sight of 99 or 93 nodes… I guess one should expect them being visible (at least one because of min_workers == max_workers == initial_workers entries in config…) - am I right?

Can anyone provide some hints? During the ray up command I can see log messages related only to head node IP - no other workers IP. SSH login with keys is configured properly. No errors during the ray up cluster.yaml.

Below I’m attaching the full config.

Regads,
Sebastian

sangcho · January 22, 2021, 2:57am

cc @Ameer_Haj_Ali Can you take a look?

Ameer_Haj_Ali · January 22, 2021, 3:54pm

@Dmitri, can you please take a look?

Dmitri · January 22, 2021, 5:02pm

Hi @sebzur , could you paste what the dashboard is showing?

sebzur · January 24, 2021, 8:01am

Hi all and thank you for your response. @Dmitri - I’m attaching two screenshots:

A) full ray up cluster.yaml output

B) and the dashboard view (in second post due to new users limitations - I can upload only one media per post)

As far as I understand, ray does not have to be installed on worker nodes - only Docker is required and the whole calculation processes are handled by the image specified in the YAML config? Anyway, while starting the cluster no try to connect to worker_ips is visible as well as while running any calculations - only the head is utilized.

sebzur · January 24, 2021, 8:02am

(Here comes the doashboard screen)

stefanbschneider · February 16, 2021, 8:45pm

I’m also trying to get started with running Ray/RLlib on a local cluster (see other thread) and am currently stuck at the same point:
I run ray up cluster.yaml on my laptop and it completes without errors but the dashboard only shows the head node.
When training an RL agent, it’s also only performed on the head node (according to the Ray dashboard and htop running on all cluster nodes).

I did not use the docker option though. Instead, I manually installed ray 1.2.0 and my custom environment on all machines of the cluster.

@sebzur Any news on this? Did you resolve the issue somehow?

sebzur · May 24, 2022, 10:35pm

Hi @stefanbschneider - sorry, I left Ray for a while. I’m back again and I try to re-run cluster in docker mode. However still some issues that blocks me: Publish dashboard port (aka. how to provide docker options to head node runned in container)

I wonder if you have solved your problem?

stefanbschneider · May 25, 2022, 5:29am

Hi @sebzur , sorry same for me - I have been busy with lots of other things in the last months and have not yet gotten around to this. Also, the issue isn’t so relevant for me anymore, so I probably won’t have any time to debug it any time soon.

Still, would be cool to hear if you have any updates/solutions.

Dmitri · June 10, 2022, 12:38am

Hi all, this should be solved by now!
Let me know if it’s not the case.

Topic		Replies	Views
Ray up doesn't add worker resources to ray.status() Ray Tune	3	390	November 29, 2021
Ray cluster's worker node is pending Ray Clusters	2	1230	February 8, 2022
"ray up yaml" cannot connect to worker node without error info Ray Tune	1	391	November 30, 2021
Workers Not Recognized on new Cluster Ray Clusters	5	594	March 3, 2023
Worker nodes not available with manual configuration Ray Core	5	458	May 5, 2021

Local cluster with multiple nodes in YAML config, while there's only head being started... Any hints?

Related topics