Worker node fails to launch AWS

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.44.0
  • Python version: 3.11.11
  • OS:
  • Cloud/Infrastructure: AWS EC2 g4dn.2xl
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: I expected the worker nodes to launch successfully
  • Actual: The worker nodes fail to launch even though I see them on aws

I am running the following cluster config

cluster_name: ****

provider:
  type: aws
  region: ****

auth:
  ssh_user: ubuntu
  ssh_private_key: ******
  
docker:
    image: rayproject/ray:2.44.0-py311-gpu   
    container_name: "ray_container"
    pull_before_run: true
    run_options: 
        - --gpus=all
        - -w /home/ubuntu/app
        - --ulimit nofile=65536:65536

available_node_types:
  ray.head.default:
    resources: {}
    node_config:
      InstanceType: "g4dn.2xlarge"  
      KeyName: ******
      ImageId: "ami-06835d15c4de57810"  #regular nvidia gpu ami
      
  ray.worker.default:
    resources: {}
    min_workers: 1
    max_workers: 2
    node_config:
      InstanceType: g4dn.2xlarge
      KeyName: ******
      ImageId: "ami-06835d15c4de57810"

head_node_type: ray.head.default

max_workers: 2

file_mounts: {
  "/home/ubuntu/app": "./inference_code"
}

setup_commands:
  - docker container prune -f
  - docker image prune -af
  - docker system prune -af

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=:6379 --object-manager-port=8076

worker_setup_commands:
    - sudo usermod -aG docker ubuntu
    - newgrp docker

idle_timeout_minutes: 30

I am not sure why but my head node works just fine and my worker node gets launched in my ec2 setup, but when I ssh into it, the ray_container does not exist - and in the ray dashboard I see it failed to launch. I am also struggling to find any logs as to why it failed to launch

Thanks!

Hi! Can you try ssh’ing into your head node and then checking to see if there’s anything at ~/ray_cluster_launcher.log or maybe /tmp/ray/session_latest/logs/monitor.log? That might have more error messages explaining why Ray is failing to launch.

Resources
---------------------------------------------------------------
Usage:
 3.5/8.0 CPU
 0.31000000000000005/1.0 GPU
 0B/21.34GiB memory
 0B/9.15GiB object_store_memory

Demands:
 (no resource demands)
2025-05-09 04:59:00,075 INFO autoscaler.py:461 -- The autoscaler took 0.098 seconds to complete the update iteration.
2025-05-09 04:59:05,182 INFO autoscaler.py:146 -- The autoscaler took 0.098 seconds to fetch the list of non-terminated nodes.
2025-05-09 04:59:05,182 INFO autoscaler.py:418 -- 
======== Autoscaler status: 2025-05-09 04:59:05.182786 ========
Node status
---------------------------------------------------------------
Active:
 1 ray.head.default
Pending:
 ray.worker.default, 1 launching
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 3.5/8.0 CPU
 0.31000000000000005/1.0 GPU
 0B/21.34GiB memory
 0B/9.15GiB object_store_memory

Demands:
 (no resource demands)
2025-05-09 04:59:05,183 INFO autoscaler.py:461 -- The autoscaler took 0.1 seconds to complete the update iteration.
2025-05-09 04:59:10,298 INFO autoscaler.py:146 -- The autoscaler took 0.106 seconds to fetch the list of non-terminated nodes.
2025-05-09 04:59:10,298 INFO autoscaler.py:418 -- 
======== Autoscaler status: 2025-05-09 04:59:10.298315 ========
Node status
---------------------------------------------------------------
Active:
 1 ray.head.default
Pending:
 ray.worker.default, 1 launching
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 3.5/8.0 CPU
 0.31000000000000005/1.0 GPU
 0B/21.34GiB memory
 0B/9.15GiB object_store_memory

Demands:
 (no resource demands)
2025-05-09 04:59:10,299 INFO autoscaler.py:461 -- The autoscaler took 0.107 seconds to complete the update iteration.

and I do see this when doing ray.init()

(autoscaler +7s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +7s) Adding 1 node(s) of type ray.worker.default.
```