Hi all!
I’m really hoping you can help me out here as I ran out of ideas what else to try…
I am trying to set up an AWS cluster which uses my own docker image available from the AWS registry. Everything works and looks OK. I can connect to the head normally and execute the script, the autoscaler initializes new workers as I would expect. Docker is installed via the initialization_command
field in the yaml
file.
But for some reason head cannot connect to workers, or the workers aren’t set up correctly and I’m not sure what’s wrong. I can see them ready in the AWS EC2 instance dashboard, but I don’t see them in the ray dashboard.
Additionally, when I connect to the head with ray attach
, I am inside docker, and if I ssh
to the head instance, I see the docker files and container running. On the other hand, I can also ssh
to the worker, but I don’t see any docker files and docker containers. I expected that these would be in sync.
Here are two monitoring logs from the head node, one while the process was running and one where I killed the running task on head (just in case it would show anything different):
I’d really appreciate some insight into the issue. Thanks!
Here is also my cluster file:
cluster_name: gem-base
max_workers: 4
upscaling_speed: 1.0
docker:
image: "214830741341.dkr.ecr.eu-central-1.amazonaws.com/gem-base:latest"
container_name: "ray_container"
pull_before_run: True
run_options: [] # Extra options to pass into "docker run"
idle_timeout_minutes: 30
provider:
type: aws
region: eu-central-1
availability_zone: eu-central-1a,eu-central-1b,eu-central-1c
cache_stopped_nodes: True # If not present, the default is True.
auth:
ssh_user: ubuntu
available_node_types:
ray.head.default:
min_workers: 1
max_workers: 1
resources: {}
node_config:
InstanceType: m5.large
ImageId: ami-05f7491af5eef733a
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 50
ray.worker.default:
min_workers: 1
max_workers: 3
resources: {}
node_config:
InstanceType: m5.xlarge
ImageId: ami-05f7491af5eef733a
InstanceMarketOptions:
MarketType: spot
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 50
head_node_type: ray.head.default
file_mounts: {}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
initialization_commands:
- curl -fsSL https://get.docker.com -o get-docker.sh
- sudo sh get-docker.sh
- sudo usermod -aG docker $USER
- sudo systemctl restart docker -f
- sudo apt install awscli -y
- aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin 214830741341.dkr.ecr.eu-central-1.amazonaws.com
setup_commands: []
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
head_node: {}
worker_nodes: {}