Autoscaler spawns workers, but they aren't set up correctly and/or head cannot connect to them

Hi all!

I’m really hoping you can help me out here as I ran out of ideas what else to try…

I am trying to set up an AWS cluster which uses my own docker image available from the AWS registry. Everything works and looks OK. I can connect to the head normally and execute the script, the autoscaler initializes new workers as I would expect. Docker is installed via the initialization_command field in the yaml file.

But for some reason head cannot connect to workers, or the workers aren’t set up correctly and I’m not sure what’s wrong. I can see them ready in the AWS EC2 instance dashboard, but I don’t see them in the ray dashboard.

Additionally, when I connect to the head with ray attach, I am inside docker, and if I ssh to the head instance, I see the docker files and container running. On the other hand, I can also ssh to the worker, but I don’t see any docker files and docker containers. I expected that these would be in sync.

Here are two monitoring logs from the head node, one while the process was running and one where I killed the running task on head (just in case it would show anything different):

I’d really appreciate some insight into the issue. Thanks!

Here is also my cluster file:

cluster_name: gem-base

max_workers: 4

upscaling_speed: 1.0

docker:
    image: "214830741341.dkr.ecr.eu-central-1.amazonaws.com/gem-base:latest"
    container_name: "ray_container"
    pull_before_run: True
    run_options: []  # Extra options to pass into "docker run"

idle_timeout_minutes: 30

provider:
    type: aws
    region: eu-central-1
    availability_zone: eu-central-1a,eu-central-1b,eu-central-1c
    cache_stopped_nodes: True # If not present, the default is True.

auth:
    ssh_user: ubuntu
available_node_types:
    ray.head.default:
        min_workers: 1
        max_workers: 1
        resources: {}
        node_config:
            InstanceType: m5.large
            ImageId: ami-05f7491af5eef733a
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 50
    ray.worker.default:
        min_workers: 1
        max_workers: 3
        resources: {}
        node_config:
            InstanceType: m5.xlarge
            ImageId: ami-05f7491af5eef733a
            InstanceMarketOptions:
                MarketType: spot
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                    VolumeSize: 50

head_node_type: ray.head.default

file_mounts: {}

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands:
    - curl -fsSL https://get.docker.com -o get-docker.sh
    - sudo sh get-docker.sh
    - sudo usermod -aG docker $USER
    - sudo systemctl restart docker -f
    - sudo apt install awscli -y
    - aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin 214830741341.dkr.ecr.eu-central-1.amazonaws.com


setup_commands: []

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}