Ray cluster launcher cannot SSH into head node

  • High

Similar to this question: Not able to ssh into head node during ray up

The command:

ray up clusters/data.yaml --no-config-cache -vvvvvvvvv 2>&1 | tee cluster.log

produces the following output:

2022-10-08 22:33:23,082 INFO util.py:357 -- setting max workers for head node type to 0
2022-10-08 22:33:23,082 INFO util.py:361 -- setting max workers for ray.worker.default to 2
2022-10-08 22:33:23,194 ERROR commands.py:355 -- Failed to autodetect node resources.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 352, in _bootstrap_config
    config = provider_cls.fillout_available_node_types_resources(config)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 603, in fillout_available_node_types_resources
    instances_list = list_ec2_instances(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 84, in list_ec2_instances
    instance_types = ec2.describe_instance_types()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/botocore/client.py", line 563, in __getattr__
    raise AttributeError(
AttributeError: 'EC2' object has no attribute 'describe_instance_types'
Cluster: data

Checking AWS environment settings
Creating AWS resource `ec2` in `us-west-2`
Creating AWS resource `iam` in `us-west-2`
Creating AWS resource `ec2` in `us-west-2`
AWS config
  IAM Profile: ray-autoscaler-v1 [default]
  EC2 Key pair (all available node types): ray-autoscaler_us-west-2 [default]
  VPC Subnets (all available node types): subnet-48d1bb63, subnet-9a12e0c7, subnet-eae30092, subnet-9db6bbd6 [default]
  EC2 Security groups (all available node types): sg-0748ab06496352a82 [default]
  EC2 AMI (all available node types): ami-08e2d37b6a0129927

Creating AWS resource `ec2` in `us-west-2`
No head node found. Launching a new cluster. Confirm [y/N]: ssh: connect to host 35.92.231.31 port 22: Connection timed out

Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Acquiring an up-to-date head node
  Launched 1 nodes [subnet_id=subnet-48d1bb63]
    Launched instance i-0fe99788ff109a760 [state=pending, info=pending]
  Launched a new head node
  Fetching the new head node
  
<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Waiting for IP
      Not yet available, retrying in 5 seconds
      Received: 35.92.231.31
    Running `uptime`
      Full command is `ssh -tt -i /home/ray/.ssh/ray-autoscaler_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/8d777f385d/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@35.92.231.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '35.92.231.31' (ECDSA) to the list of known hosts.
ubuntu@35.92.231.31: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    Running `uptime`
      Full command is `ssh -tt -i /home/ray/.ssh/ray-autoscaler_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/8d777f385d/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@35.92.231.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '35.92.231.31' (ECDSA) to the list of known hosts.
ubuntu@35.92.231.31: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    Running `uptime`
      Full command is `ssh -tt -i /home/ray/.ssh/ray-autoscaler_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/8d777f385d/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@35.92.231.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '35.92.231.31' (ECDSA) to the list of known hosts.
ubuntu@35.92.231.31: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    Running `uptime`
      Full command is `ssh -tt -i /home/ray/.ssh/ray-autoscaler_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/8d777f385d/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@35.92.231.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '35.92.231.31' (ECDSA) to the list of known hosts.
ubuntu@35.92.231.31: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
    SSH still not available (SSH command failed.), retrying in 5 seconds.

this happens indefinitely.

Manually SSH-ing in with:

ssh  -i /home/ray/.ssh/ray-autoscaler_us-west-2.pem ubuntu@35.92.231.31

also fails with:

ubuntu@35.92.231.31: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

The instance is running and 2/2 status checks are passing on AWS.

My cluster file is:

cluster_name: data
max_workers: 2

provider:
    type: aws
    region: us-west-2
docker:
  image: ray.cpu
  container_name: ray
  pull_before_run: True
  # Present in Ray template
  run_options: 
    - --ulimit nofile=65536:65536

available_node_types:
  ray.head.default:
    node_config:
      InstanceType: m5.large
      # Amazon Linux 2
      # ImageId: ami-026b57f3c383c2eec
      ImageId: ami-08e2d37b6a0129927
    resources: {}
  ray.worker.default:
    node_config:
      InstanceType: m5.large
      # ImageId: ami-026b57f3c383c2eec
      ImageId: ami-08e2d37b6a0129927
      InstanceMarketOptions:
        MarketType: spot
    resources: {}
    
head_node_type: ray.head.default

Running ray --version gives: ray, version 2.0.1.

The Docker image: ray.cpu, is basically just: rayproject/ray:2.0.1-py38-cpu. I’m hosting this image on AWS ECR.

Switching the Docker image to: rayproject/ray:2.0.1-py38-cpu does not help.
(Thought the issue might be that the head node could not pull the Docker image).

Running:

chmod 600 /home/ray/.ssh/ray-autoscaler_us-west-2.pem

and

chmod 700 /home/ray/.ssh/

does not help either.

I was using Amazon Linux, which means I needed to to use ec2-user.

So:

auth:
  ssh_user: ec2-user