- High
Similar to this question: Not able to ssh into head node during ray up
The command:
ray up clusters/data.yaml --no-config-cache -vvvvvvvvv 2>&1 | tee cluster.log
produces the following output:
2022-10-08 22:33:23,082 INFO util.py:357 -- setting max workers for head node type to 0
2022-10-08 22:33:23,082 INFO util.py:361 -- setting max workers for ray.worker.default to 2
2022-10-08 22:33:23,194 ERROR commands.py:355 -- Failed to autodetect node resources.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 352, in _bootstrap_config
config = provider_cls.fillout_available_node_types_resources(config)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 603, in fillout_available_node_types_resources
instances_list = list_ec2_instances(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 84, in list_ec2_instances
instance_types = ec2.describe_instance_types()
File "/home/ray/anaconda3/lib/python3.8/site-packages/botocore/client.py", line 563, in __getattr__
raise AttributeError(
AttributeError: 'EC2' object has no attribute 'describe_instance_types'
Cluster: data
Checking AWS environment settings
Creating AWS resource `ec2` in `us-west-2`
Creating AWS resource `iam` in `us-west-2`
Creating AWS resource `ec2` in `us-west-2`
AWS config
IAM Profile: ray-autoscaler-v1 [default]
EC2 Key pair (all available node types): ray-autoscaler_us-west-2 [default]
VPC Subnets (all available node types): subnet-48d1bb63, subnet-9a12e0c7, subnet-eae30092, subnet-9db6bbd6 [default]
EC2 Security groups (all available node types): sg-0748ab06496352a82 [default]
EC2 AMI (all available node types): ami-08e2d37b6a0129927
Creating AWS resource `ec2` in `us-west-2`
No head node found. Launching a new cluster. Confirm [y/N]: ssh: connect to host 35.92.231.31 port 22: Connection timed out
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Acquiring an up-to-date head node
Launched 1 nodes [subnet_id=subnet-48d1bb63]
Launched instance i-0fe99788ff109a760 [state=pending, info=pending]
Launched a new head node
Fetching the new head node
<1/1> Setting up head node
Prepared bootstrap config
New status: waiting-for-ssh
[1/7] Waiting for SSH to become available
Running `uptime` as a test.
Waiting for IP
Not yet available, retrying in 5 seconds
Received: 35.92.231.31
Running `uptime`
Full command is `ssh -tt -i /home/ray/.ssh/ray-autoscaler_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/8d777f385d/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@35.92.231.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '35.92.231.31' (ECDSA) to the list of known hosts.
ubuntu@35.92.231.31: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
SSH still not available (SSH command failed.), retrying in 5 seconds.
Running `uptime`
Full command is `ssh -tt -i /home/ray/.ssh/ray-autoscaler_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/8d777f385d/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@35.92.231.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '35.92.231.31' (ECDSA) to the list of known hosts.
ubuntu@35.92.231.31: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
SSH still not available (SSH command failed.), retrying in 5 seconds.
Running `uptime`
Full command is `ssh -tt -i /home/ray/.ssh/ray-autoscaler_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/8d777f385d/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@35.92.231.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '35.92.231.31' (ECDSA) to the list of known hosts.
ubuntu@35.92.231.31: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
SSH still not available (SSH command failed.), retrying in 5 seconds.
Running `uptime`
Full command is `ssh -tt -i /home/ray/.ssh/ray-autoscaler_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/8d777f385d/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@35.92.231.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '35.92.231.31' (ECDSA) to the list of known hosts.
ubuntu@35.92.231.31: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
SSH still not available (SSH command failed.), retrying in 5 seconds.
this happens indefinitely.
Manually SSH-ing in with:
ssh -i /home/ray/.ssh/ray-autoscaler_us-west-2.pem ubuntu@35.92.231.31
also fails with:
ubuntu@35.92.231.31: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
The instance is running and 2/2 status checks are passing on AWS.
My cluster file is:
cluster_name: data
max_workers: 2
provider:
type: aws
region: us-west-2
docker:
image: ray.cpu
container_name: ray
pull_before_run: True
# Present in Ray template
run_options:
- --ulimit nofile=65536:65536
available_node_types:
ray.head.default:
node_config:
InstanceType: m5.large
# Amazon Linux 2
# ImageId: ami-026b57f3c383c2eec
ImageId: ami-08e2d37b6a0129927
resources: {}
ray.worker.default:
node_config:
InstanceType: m5.large
# ImageId: ami-026b57f3c383c2eec
ImageId: ami-08e2d37b6a0129927
InstanceMarketOptions:
MarketType: spot
resources: {}
head_node_type: ray.head.default
Running ray --version
gives: ray, version 2.0.1
.
The Docker image: ray.cpu
, is basically just: rayproject/ray:2.0.1-py38-cpu
. I’m hosting this image on AWS ECR.