Autoscaler spawns workers, but they aren't set up correctly and/or head cannot connect to them

Hi all!

I’m really hoping you can help me out here as I ran out of ideas what else to try…

I am trying to set up an AWS cluster which uses my own docker image available from the AWS registry. Everything works and looks OK. I can connect to the head normally and execute the script, the autoscaler initializes new workers as I would expect. Docker is installed via the initialization_command field in the yaml file.

But for some reason head cannot connect to workers, or the workers aren’t set up correctly and I’m not sure what’s wrong. I can see them ready in the AWS EC2 instance dashboard, but I don’t see them in the ray dashboard.

Additionally, when I connect to the head with ray attach, I am inside docker, and if I ssh to the head instance, I see the docker files and container running. On the other hand, I can also ssh to the worker, but I don’t see any docker files and docker containers. I expected that these would be in sync.

Here are two monitoring logs from the head node, one while the process was running and one where I killed the running task on head (just in case it would show anything different):

I’d really appreciate some insight into the issue. Thanks!

Here is also my cluster file:

cluster_name: gem-base

max_workers: 4

upscaling_speed: 1.0

    image: ""
    container_name: "ray_container"
    pull_before_run: True
    run_options: []  # Extra options to pass into "docker run"

idle_timeout_minutes: 30

    type: aws
    region: eu-central-1
    availability_zone: eu-central-1a,eu-central-1b,eu-central-1c
    cache_stopped_nodes: True # If not present, the default is True.

    ssh_user: ubuntu
        min_workers: 1
        max_workers: 1
        resources: {}
            InstanceType: m5.large
            ImageId: ami-05f7491af5eef733a
                - DeviceName: /dev/sda1
                      VolumeSize: 50
        min_workers: 1
        max_workers: 3
        resources: {}
            InstanceType: m5.xlarge
            ImageId: ami-05f7491af5eef733a
                MarketType: spot
                - DeviceName: /dev/sda1
                    VolumeSize: 50

head_node_type: ray.head.default

file_mounts: {}

cluster_synced_files: []

file_mounts_sync_continuously: False

    - "**/.git"
    - "**/.git/**"

    - ".gitignore"

    - curl -fsSL -o
    - sudo sh
    - sudo usermod -aG docker $USER
    - sudo systemctl restart docker -f
    - sudo apt install awscli -y
    - aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin

setup_commands: []

head_setup_commands: []

worker_setup_commands: []

    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}