Local cluster with multiple nodes in YAML config, while there's only head being started... Any hints?

Hi,

I’ve just started experimenting with cluster and autoscalers and have the following cluster.yaml entries:

min_workers: 2
initial_workers: 2
max_workers: 2

provider:
    type: local
    head_ip: 172.17.1.97
    worker_ips: [172.17.1.99, 172.17.1.93]

I’m able to run ray up cluster.yaml and the cluster head node starts (image is downloaded and head container is initialized. I can connect to dashboard and see one worker ready to handle tasks (172.17.1.99), however no sight of 99 or 93 nodes… I guess one should expect them being visible (at least one because of min_workers == max_workers == initial_workers entries in config…) - am I right?

Can anyone provide some hints? During the ray up command I can see log messages related only to head node IP - no other workers IP. SSH login with keys is configured properly. No errors during the ray up cluster.yaml.

Below I’m attaching the full config.

Regads,
Sebastian

1 Like

cc @Ameer_Haj_Ali Can you take a look?

@Dmitri, can you please take a look?

Hi @sebzur , could you paste what the dashboard is showing?

Hi all and thank you for your response. @Dmitri - I’m attaching two screenshots:

A) full ray up cluster.yaml output

B) and the dashboard view (in second post due to new users limitations - I can upload only one media per post)

As far as I understand, ray does not have to be installed on worker nodes - only Docker is required and the whole calculation processes are handled by the image specified in the YAML config? Anyway, while starting the cluster no try to connect to worker_ips is visible as well as while running any calculations - only the head is utilized.

(Here comes the doashboard screen)

I’m also trying to get started with running Ray/RLlib on a local cluster (see other thread) and am currently stuck at the same point:
I run ray up cluster.yaml on my laptop and it completes without errors but the dashboard only shows the head node.
When training an RL agent, it’s also only performed on the head node (according to the Ray dashboard and htop running on all cluster nodes).

I did not use the docker option though. Instead, I manually installed ray 1.2.0 and my custom environment on all machines of the cluster.

@sebzur Any news on this? Did you resolve the issue somehow?

Hi @stefanbschneider - sorry, I left Ray for a while. I’m back again and I try to re-run cluster in docker mode. However still some issues that blocks me: Publish dashboard port (aka. how to provide docker options to head node runned in container)

I wonder if you have solved your problem?

Hi @sebzur , sorry same for me - I have been busy with lots of other things in the last months and have not yet gotten around to this. Also, the issue isn’t so relevant for me anymore, so I probably won’t have any time to debug it any time soon.

Still, would be cool to hear if you have any updates/solutions.

Hi all, this should be solved by now!
Let me know if it’s not the case.