How to force Ray cluster to use custom pem while connecting to worker node on AWS?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi Everyone, I am launching my Cluster using yaml file with a custom pem file and I provide like below, my cluster is launching perfectly fine however when head_node launches new worker_node it launches it with the custom pem file however tries to ssh it with the default ~/ray_bootstrap_key.pem file. Because of this behavior Head node is not able to ssh to worker_node after it’s launched and submit jobs and throws below error.

2023-06-05 16:03:06,856	VVINFO command_runner.py:374 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/a1664639b9/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@10.0.1.213 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`

==> /tmp/ray/session_latest/logs/monitor.err <==
ssh: connect to host 10.0.1.100 port 22: Connection timed out

==> /tmp/ray/session_latest/logs/monitor.log <==
2023-06-05 16:03:07,216	INFO autoscaler.py:148 -- The autoscaler took 0.069 seconds to fetch the list of non-terminated nodes.
2023-06-05 16:03:07,216	INFO autoscaler.py:423 -- 
======== Autoscaler status: 2023-06-05 16:03:07.216757 ========
Node status
---------------------------------------------------------------
Healthy:
 1 head_node_r5
Pending:
 10.0.1.213: ray.worker.p2x, waiting-for-ssh
 10.0.1.100: ray.worker.p2x, waiting-for-ssh
Recent failures:
 (no failures)

How can I force ray auto_scaler to use the same custom pem file for ssh too?
I specify my custom key in parameter ssh_private_key and also specify in head_node and worker_node config in parameter KeyName