How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello, I am trying to launch a Ray cluster on a local (self-hosted) set of servers. I have 1 head node and 2 worker nodes configured.
cluster_name: raytest
provider:
type: local
head_ip: frontal.cluster.lan.example.com
worker_ips: [node1.cluster.lan.example.com, node2.cluster.lan.example.com]
auth:
ssh_user: dimitri.lozeve
ssh_private_key: ~/.ssh/id_ed25519
# [...] other options from example-full.yaml
When launching with ray up -vvvvv --no-config-cache cluster-config.yaml
, the head node starts properly but the worker nodes get stuck in waiting-for-ssh
node.
======== Autoscaler status: 2022-06-30 09:33:42.267203 ========
Node status
---------------------------------------------------------------
Healthy:
Pending:
10.168.11.11: local.cluster.node, waiting-for-ssh
10.168.11.12: local.cluster.node, waiting-for-ssh
127.0.1.1: local.cluster.node, waiting-for-ssh
Recent failures:
(no failures)
==> /tmp/ray/session_latest/logs/monitor.out <==
2022-06-30 09:33:42,397 VINFO command_runner.py:552 -- Running `uptime`
2022-06-30 09:33:42,397 VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_4e3a8ef667/2283bd7c04/%C -o ControlPersist=10s -o ConnectTimeout=5s dimitri.lozeve@10.168.11.11 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2022-06-30 09:33:42,402 VINFO command_runner.py:552 -- Running `uptime`
2022-06-30 09:33:42,402 VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_4e3a8ef667/2283bd7c04/%C -o ControlPersist=10s -o ConnectTimeout=5s dimitri.lozeve@10.168.11.12 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
==> /tmp/ray/session_latest/logs/monitor.err <==
dimitri.lozeve@10.168.11.11: Permission denied (publickey).
However, if I copy-paste the SSH command above on the head node, I can successfully login to the worker node (although there is a password prompt to unlock the private key).