How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I’m trying to create a ray cluster in AWS. However it seems to run into some errors when launching the head node. To me, it seems like it’s not able to pull the docker image although I can’t seem to figure out why. For context, I’ve set all my subnets as public right now and the attached a security group that allows all egress.
cluster_name: pro-staging
max_workers: 30
upscaling_speed: 30
docker:
image: rayproject/ray:latest
container_name: "ray_container"
pull_before_run: True
run_options:
- --ulimit nofile=65536:65536
head_image: rayproject/ray:latest
worker_image: rayproject/ray-ml:latest-gpu
idle_timeout_minutes: 5
provider:
type: aws
region: eu-central-1
security_group_ids:
- sg-18dd3es021c5f13f4
use_internal_ips: True
cache_stopped_nodes: False
auth:
ssh_user: ubuntu
available_node_types:
ray.head.default:
node_config:
InstanceType: m5.4xlarge
SubnetIds:
- subnet-1234a1bfebaab1f11
- subnet-2b4679e428c555553
- subnet-33334dc30718ggg0e
resources: {"CPU": 8, "memory": 8192}
ray.worker.default:
min_workers: 0
max_workers: 120
node_config:
InstanceType: g5.xlarge
SubnetIds:
- subnet-1234a1bfebaab1f11
- subnet-2b4679e428c555553
- subnet-33334dc30718ggg0e
ray.worker.g4dn_xlarge:
min_workers: 0
max_workers: 120
node_config:
InstanceType: g4dn.xlarge
SubnetIds:
- subnet-1234a1bfebaab1f11
- subnet-2b4679e428c555553
- subnet-33334dc30718ggg0e
head_node_type: ray.head.default
Complete log:
> ray up staging.yaml
Cluster: pro-staging
2023-04-30 02:45:56,755 INFO util.py:372 -- setting max workers for head node type to 0
Checking AWS environment settings
AWS config
IAM Profile: ray-autoscaler-v1 [default]
EC2 Key pair (all available node types): ray-autoscaler_eu-central-1 [default]
VPC Subnets (all available node types): subnet-12e345678baab1f02, subnet-1729245267ce55033, subnet-1ce34dc79718f9d0e
EC2 Security groups (all available node types): sg-0ea8c23334dc9d519 [default]
EC2 AMI (all available node types): ami-0383bd0c1fc4c63ec [dlami]
No head node found. Launching a new cluster. Confirm [y/N]: y
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Acquiring an up-to-date head node
Launched 1 nodes [subnet_id=subnet-12e345678baab1f02]
Launched instance i-0646c847f124af8dd [state=pending, info=pending]
Launched a new head node
Fetching the new head node
<1/1> Setting up head node
Prepared bootstrap config
New status: waiting-for-ssh
[1/7] Waiting for SSH to become available
Running `uptime` as a test.
Fetched IP: 171.12.11.11
ssh: connect to host 171.12.11.11 port 22: Operation timed out
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 171.12.11.11 port 22: Operation timed out
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 171.12.11.11 port 22: Operation timed out
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 171.12.11.11 port 22: Operation timed out
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 171.12.11.11 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 171.12.11.11 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 171.12.11.11 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '171.12.11.11' (ED25519) to the list of known hosts.
23:47:41 up 1 min, 1 user, load average: 1.74, 0.52, 0.18
Shared connection to 171.12.11.11 closed.
Success.
Updating cluster configuration. [hash=ae847bdcc0d2424d338eaf4835bf8413714b23b7]
New status: syncing-files
[2/7] Processing file mounts
Shared connection to 171.12.11.11 closed.
Shared connection to 171.12.11.11 closed.
[3/7] No worker file mounts to sync
New status: setting-up
[4/7] No initialization commands to run.
[5/7] Initializing command runner
Shared connection to 171.12.11.11 closed.
Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Shared connection to 171.12.11.11 closed.
New status: update-failed
!!!
SSH command failed.
!!!
Failed to setup head node.
Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded
The error message changed when I tried it repeatedly, but at its core it seems like a single error. Any pointers as to what I can investigate further would be very helpful.