Head node fails to ssh into worker nodes

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello,
I’m having a bit of trouble getting the cluster to work on Azure. The cluster was created successfully. However, when the head node tries to create a worker node, the worker node is stuck waiting for ssh and the head node gets a permission denied error:

======== Autoscaler status: 2022-08-16 02:20:52.990837 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 10.103.0.5: ray.worker.default, waiting-for-ssh
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 2.0/2.0 CPU
 0.00/4.179 GiB memory
 0.00/2.089 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 7521+ pending tasks/actors

==> /tmp/ray/session_latest/logs/monitor.err <==
Warning: Permanently added '10.103.0.5' (ECDSA) to the list of known hosts.
ubuntu@10.103.0.5: Permission denied (publickey).

==> /tmp/ray/session_latest/logs/monitor.out <==
2022-08-16 02:20:54,783 VINFO command_runner.py:552 -- Running `uptime`
2022-08-16 02:20:54,783 VVINFO command_runner.py:555 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@10.103.0.5 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
2022-08-16 02:20:54,817 INFO updater.py:316 -- SSH still not available (SSH command failed.), retrying in 5 seconds.

==> /tmp/ray/session_latest/logs/monitor.log <==
2022-08-16 02:20:58,358 INFO autoscaler.py:330 --

Config file:

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 1

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty object means disabled.
docker:
    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

    # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"
    # Allow Ray to automatically detect GPUs

    # worker_image: "rayproject/ray-ml:latest-cpu"
    # worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: azure
    # https://azure.microsoft.com/en-us/global-infrastructure/locations
    location: eastus2
    resource_group: RayCluster03
    # set subscription id otherwise the default from az cli will be used
    subscription_id: 00000000-0000-0000-0000-000000000000

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    # you must specify paths to matching private and public key pair files
    # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
    ssh_private_key: ~/.ssh/id_rsa
    # changes to this should match what is specified in file_mounts
    ssh_public_key: ~/.ssh/id_rsa.pub

# More specific customization to node configurations can be made using the ARM template azure-vm-template.json file
# See documentation here: https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/2019-03-01/virtualmachines
# Changes to the local file will be used during deployment of the head node, however worker nodes deployment occurs
# on the head node, so changes to the template must be included in the wheel file used in setup_commands section below

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The resources provided by this node type.
        resources: {"CPU": 2}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D2s_v3
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                priority: Low

    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 4
        # The resources provided by this node type.
        resources: {"CPU": 2}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D2s_v3
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Low
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
     "~/.ssh/id_rsa.pub": "~/.ssh/id_rsa.pub"
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
    # enable docker setup
    - sudo usermod -aG docker $USER || true
    - sleep 10  # delay to avoid docker permission denied errors
    # get rid of annoying Ubuntu message
    - touch ~/.sudo_as_admin_successful

# List of shell commands to run to set up nodes.
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands:
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
# NOTE: rayproject/ray-ml:latest has azure packages bundled
head_setup_commands:
    # - pip install -U azure-cli-core==2.22.0 azure-mgmt-compute==14.0.0 azure-mgmt-msi==1.0.0 azure-mgmt-network==10.2.0 azure-mgmt-resource==13.0.0
    - pip install -U azure-cli-core==2.22.0 azure-mgmt-compute==17.0.0b1 azure-mgmt-msi==1.0.0 azure-identity==1.6.1 azure-mgmt-network==19.0.0
    

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}

Ray version: 1.13.0

Manually ssh’ing into the worker works fine. How can I fix this?

Thanks

Are you providing anything in the auth section? Minimal Azure example: ray/example-minimal.yaml at master · ray-project/ray · GitHub

Sorry, disregard that, missed it in your config

cc @gramhagen Any guesses here? Possibly related to this issue, but it seems the head node was able to set up fine.

From the error message it sounds like the ssh keys are not being set up correctly. Can you provide the commands you are using to manually ssh to the machine? Are you able to ssh to that worker from the head node or using ray cli?

I successfully ssh’ed into the worker node from the head node with this command:

ssh -i ray_bootstrap_key.pem ubuntu@10.103.0.5

And this was the output:

(base) ray@ray-default-head-50d6459f0:~$ ssh -i ray_bootstrap_key.pem ubuntu@10.103.0.5
The authenticity of host '10.103.0.5 (10.103.0.5)' can't be established.
ECDSA key fingerprint is SHA256:apKXzq8gqxCjEifDwhBIybQXNgJ5AB5uuXdzfA9OnW0.
Are you sure you want to continue connecting (yes/no)? yes
Failed to add the host to the list of known hosts (/home/ray/.ssh/known_hosts).
Enter passphrase for key 'ray_bootstrap_key.pem': 
Welcome to Ubuntu 18.04.6 LTS (GNU/Linux 5.4.0-1085-azure x86_64)

  System information as of Wed Aug 17 15:38:52 UTC 2022

  System load:  0.58                Processes:              157
  Usage of /:   47.2% of 145.20GB   Users logged in:        1
  Memory usage: 14%                 IP address for eth0:    10.103.0.5
  Swap usage:   0%                  IP address for docker0: 172.17.0.1

 * Super-optimized for small spaces - read how we shrank the memory
   footprint of MicroK8s to make it the smallest full K8s around.

   https://ubuntu.com/blog/microk8s-memory-optimisation

1 update can be applied immediately.
To see these additional updates run: apt list --upgradable

New release '20.04.4 LTS' available.
Run 'do-release-upgrade' to upgrade to it.


***************************************************************************
* Welcome to the Ubuntu 18.04 Data Science Virtual Machine!               *
*                                                                         *
* You can access this DSVM, view the graphical desktop with               *
* X2Go, or run JupyterLab from a browser on your computer                 *
* For more information, see the docs at https://aka.ms/dsvm/docs.         *
***************************************************************************

Last login: Wed Aug 17 15:37:01 2022 from 149.143.127.100
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

this part seems abnormal to me, did you have to enter the passphrase? i don’t think ray will be able to do that automatically

Yes I had to enter the passphrase.