Worker nodes stuck in "waiting-for-ssh"

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello, I am trying to launch a Ray cluster on a local (self-hosted) set of servers. I have 1 head node and 2 worker nodes configured.

cluster_name: raytest
provider:
    type: local
    head_ip: frontal.cluster.lan.example.com
    worker_ips: [node1.cluster.lan.example.com, node2.cluster.lan.example.com]
auth:
    ssh_user: dimitri.lozeve
    ssh_private_key: ~/.ssh/id_ed25519

# [...] other options from example-full.yaml

When launching with ray up -vvvvv --no-config-cache cluster-config.yaml, the head node starts properly but the worker nodes get stuck in waiting-for-ssh node.

======== Autoscaler status: 2022-06-30 09:33:42.267203 ========
Node status
---------------------------------------------------------------
Healthy:

Pending:
 10.168.11.11: local.cluster.node, waiting-for-ssh
 10.168.11.12: local.cluster.node, waiting-for-ssh
 127.0.1.1: local.cluster.node, waiting-for-ssh
Recent failures:
 (no failures)
==> /tmp/ray/session_latest/logs/monitor.out <==
2022-06-30 09:33:42,397	VINFO command_runner.py:552 -- Running `uptime`
2022-06-30 09:33:42,397	VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_4e3a8ef667/2283bd7c04/%C -o ControlPersist=10s -o ConnectTimeout=5s dimitri.lozeve@10.168.11.11 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2022-06-30 09:33:42,402	VINFO command_runner.py:552 -- Running `uptime`
2022-06-30 09:33:42,402	VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_4e3a8ef667/2283bd7c04/%C -o ControlPersist=10s -o ConnectTimeout=5s dimitri.lozeve@10.168.11.12 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`

==> /tmp/ray/session_latest/logs/monitor.err <==
dimitri.lozeve@10.168.11.11: Permission denied (publickey).

However, if I copy-paste the SSH command above on the head node, I can successfully login to the worker node (although there is a password prompt to unlock the private key).

although there is a password prompt to unlock the private key

This sounds like the issue. Is there a way you can create a separate key without a password for the autoscaler?

Thank you! Indeed I think that was an issue. To sidestep the problem, I now launch the cluster from the head node directly, without specifying ssh_private_key (I believe this means that the autoscaler will generate a key for me).

Now the cluster seems to launch, but the workers get stuck at “setting-up”, and I get the following error:

2022-07-01 15:47:39,090 WARNING utils.py:1242 -- Unable to connect to GCS at 10.168.11.25:6380. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

The GCS is running on port 6380 and listening, and the ports are open.

From Python, ray.nodes() reports only the head node.

Do you know where this problem might come from?

Can you share some more details about the initialization_commands and setup_commands you have (if any).

Also what version of Ray is running on the worker nodes and is there a firewall between the head and worker nodes?

Here are the relevant commands:

setup_commands:
    - pip3 install -U "ray[default]"

head_start_ray_commands:
    - ray stop
    - ulimit -c unlimited && ray start --head --port=6380 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6380

initialization_commands, head_setup_commands, and worker_setup_commands are all empty.

I am running Ray 1.13.0 on Python 3.9.2 (Debian 11). There is no firewall between the head and worker nodes, all the ports are open. (There is already a Redis service running on port 6379, so I configured Ray to use port 6380.)

I see, this should work… Do you mind ssh’ing into a worker node and sharing any erroring logs in /tmp/ray/session_latest on the worker nodes? To access a worker node you can do ssh -i ~/bootstrap_key.pem ubuntu@<worker-ip> on the head node.

There is no /tmp/ray directory on any of the worker nodes…

In ray monitor cluster-config.yaml, I still have the “Unable to connect to GCS” error, plus this:

==> /tmp/ray/session_latest/logs/monitor.out <==                                                                                                                                                                     
2022-07-06 08:51:48,200 WARNING utils.py:1242 -- Unable to connect to GCS at 127.0.1.1:6380. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall
 setting preventing access.                                                                                                                                                                                          
2022-07-06 08:51:48,335 WARNING utils.py:1242 -- Unable to connect to GCS at 127.0.1.1:6380. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall
 setting preventing access.                                                                                                                                                                                          
Traceback (most recent call last):                                                                                                                                                                                   
  File "/nfs/home/dimitri.lozeve/.local/bin/ray", line 8, in <module>                                                                                                                                                
    sys.exit(main())                                                                                                                                                                                                 
  File "/nfs/home/dimitri.lozeve/.local/lib/python3.9/site-packages/ray/scripts/scripts.py", line 2339, in main
    return cli()
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/nfs/home/dimitri.lozeve/.local/lib/python3.9/site-packages/ray/autoscaler/_private/cli_logger.py", line 852, in wrapper
    return f(*args, **kwargs)
  File "/nfs/home/dimitri.lozeve/.local/lib/python3.9/site-packages/ray/scripts/scripts.py", line 872, in start
    node = ray.node.Node(
  File "/nfs/home/dimitri.lozeve/.local/lib/python3.9/site-packages/ray/node.py", line 185, in __init__
    session_name = ray._private.utils.internal_kv_get_with_retry(
  File "/nfs/home/dimitri.lozeve/.local/lib/python3.9/site-packages/ray/_private/utils.py", line 1258, in internal_kv_get_with_retry
    raise RuntimeError(
RuntimeError: Could not read 'session_name' from GCS. Did GCS start successfully?
2022-07-06 08:51:50,272 WARNING utils.py:1242 -- Unable to connect to GCS at 127.0.1.1:6380. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall
 setting preventing access.

In case it’s useful, I can see these .err log files on the head node:

$ cat ray_client_server.err 
2022-07-06 08:49:15,226 INFO server.py:843 -- Starting Ray Client server on 0.0.0.0:10001
$ cat monitor.err
Shared connection to 10.168.11.12 closed.
Shared connection to 10.168.11.11 closed.
Shared connection to 10.168.11.25 closed.
Shared connection to 10.168.11.12 closed.
Shared connection to 10.168.11.11 closed.
Shared connection to 10.168.11.25 closed.
Shared connection to 10.168.11.12 closed.
Shared connection to 10.168.11.11 closed.
Shared connection to 10.168.11.25 closed.

Thanks for your help!

What I find surprising, is that I can launch a Ray cluster manually without issue.

On the head node:

ray start --head --port=6380

On the worker(s):

ray start --address='10.168.11.10:6380'

And everything works fine. I would just like to use ray up to automate this process, and use the YAML file as a kind of “infrastructure as code”. I’m not sure what the autoscaler does differently?

The autoscaler provides no benefit here besides simplifying the startup process. It looks like that’s not working in your case, unfortunately.
I’d recommend writing a simple script to automate the manual process you’ve figured out.