Connection timeout when pulling docker image of head node

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m trying to create a ray cluster in AWS. However it seems to run into some errors when launching the head node. To me, it seems like it’s not able to pull the docker image although I can’t seem to figure out why. For context, I’ve set all my subnets as public right now and the attached a security group that allows all egress.

cluster_name: pro-staging
max_workers: 30
upscaling_speed: 30
docker:
  image: rayproject/ray:latest
  container_name: "ray_container"
  pull_before_run: True
  run_options:
    - --ulimit nofile=65536:65536
  head_image: rayproject/ray:latest
  worker_image: rayproject/ray-ml:latest-gpu
idle_timeout_minutes: 5
provider:
    type: aws
    region: eu-central-1
    security_group_ids:
      - sg-18dd3es021c5f13f4
    use_internal_ips: True
    cache_stopped_nodes: False
auth:
    ssh_user: ubuntu
available_node_types:
  ray.head.default:
    node_config:
      InstanceType: m5.4xlarge
      SubnetIds:
      - subnet-1234a1bfebaab1f11
      - subnet-2b4679e428c555553
      - subnet-33334dc30718ggg0e
    resources: {"CPU": 8, "memory": 8192}
  ray.worker.default:
    min_workers: 0
    max_workers: 120
    node_config:
      InstanceType: g5.xlarge
      SubnetIds:
      - subnet-1234a1bfebaab1f11
      - subnet-2b4679e428c555553
      - subnet-33334dc30718ggg0e
  ray.worker.g4dn_xlarge:
    min_workers: 0
    max_workers: 120
    node_config:
      InstanceType: g4dn.xlarge
      SubnetIds:
      - subnet-1234a1bfebaab1f11
      - subnet-2b4679e428c555553
      - subnet-33334dc30718ggg0e
​
head_node_type: ray.head.default

Complete log:

> ray up staging.yaml
Cluster: pro-staging
2023-04-30 02:45:56,755 INFO util.py:372 -- setting max workers for head node type to 0
Checking AWS environment settings
AWS config
  IAM Profile: ray-autoscaler-v1 [default]
  EC2 Key pair (all available node types): ray-autoscaler_eu-central-1 [default]
  VPC Subnets (all available node types): subnet-12e345678baab1f02, subnet-1729245267ce55033, subnet-1ce34dc79718f9d0e
  EC2 Security groups (all available node types): sg-0ea8c23334dc9d519 [default]
  EC2 AMI (all available node types): ami-0383bd0c1fc4c63ec [dlami]
No head node found. Launching a new cluster. Confirm [y/N]: y
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Acquiring an up-to-date head node
  Launched 1 nodes [subnet_id=subnet-12e345678baab1f02]
    Launched instance i-0646c847f124af8dd [state=pending, info=pending]
  Launched a new head node
  Fetching the new head node
<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 171.12.11.11
ssh: connect to host 171.12.11.11 port 22: Operation timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 171.12.11.11 port 22: Operation timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 171.12.11.11 port 22: Operation timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 171.12.11.11 port 22: Operation timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 171.12.11.11 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 171.12.11.11 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 171.12.11.11 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '171.12.11.11' (ED25519) to the list of known hosts.
 23:47:41 up 1 min,  1 user,  load average: 1.74, 0.52, 0.18
Shared connection to 171.12.11.11 closed.
    Success.
  Updating cluster configuration. [hash=ae847bdcc0d2424d338eaf4835bf8413714b23b7]
  New status: syncing-files
  [2/7] Processing file mounts
Shared connection to 171.12.11.11 closed.
Shared connection to 171.12.11.11 closed.
  [3/7] No worker file mounts to sync
  New status: setting-up
  [4/7] No initialization commands to run.
  [5/7] Initializing command runner
Shared connection to 171.12.11.11 closed.
Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Shared connection to 171.12.11.11 closed.
  New status: update-failed
  !!!
  SSH command failed.
  !!!
  Failed to setup head node.
Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded

The error message changed when I tried it repeatedly, but at its core it seems like a single error. Any pointers as to what I can investigate further would be very helpful.