1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.44.0
- Python version: 3.11.11
- OS:
- Cloud/Infrastructure: AWS EC2 g4dn.2xl
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected: I expected the worker nodes to launch successfully
- Actual: The worker nodes fail to launch even though I see them on aws
I am running the following cluster config
cluster_name: ****
provider:
type: aws
region: ****
auth:
ssh_user: ubuntu
ssh_private_key: ******
docker:
image: rayproject/ray:2.44.0-py311-gpu
container_name: "ray_container"
pull_before_run: true
run_options:
- --gpus=all
- -w /home/ubuntu/app
- --ulimit nofile=65536:65536
available_node_types:
ray.head.default:
resources: {}
node_config:
InstanceType: "g4dn.2xlarge"
KeyName: ******
ImageId: "ami-06835d15c4de57810" #regular nvidia gpu ami
ray.worker.default:
resources: {}
min_workers: 1
max_workers: 2
node_config:
InstanceType: g4dn.2xlarge
KeyName: ******
ImageId: "ami-06835d15c4de57810"
head_node_type: ray.head.default
max_workers: 2
file_mounts: {
"/home/ubuntu/app": "./inference_code"
}
setup_commands:
- docker container prune -f
- docker image prune -af
- docker system prune -af
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ray start --address=:6379 --object-manager-port=8076
worker_setup_commands:
- sudo usermod -aG docker ubuntu
- newgrp docker
idle_timeout_minutes: 30
I am not sure why but my head node works just fine and my worker node gets launched in my ec2 setup, but when I ssh into it, the ray_container does not exist - and in the ray dashboard I see it failed to launch. I am also struggling to find any logs as to why it failed to launch
Thanks!