Worker node fails to launch AWS

ainatersol · May 8, 2025, 3:09pm

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

Ray version: 2.44.0
Python version: 3.11.11
OS:
Cloud/Infrastructure: AWS EC2 g4dn.2xl
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected: I expected the worker nodes to launch successfully
Actual: The worker nodes fail to launch even though I see them on aws

I am running the following cluster config

cluster_name: ****

provider:
  type: aws
  region: ****

auth:
  ssh_user: ubuntu
  ssh_private_key: ******
  
docker:
    image: rayproject/ray:2.44.0-py311-gpu   
    container_name: "ray_container"
    pull_before_run: true
    run_options: 
        - --gpus=all
        - -w /home/ubuntu/app
        - --ulimit nofile=65536:65536

available_node_types:
  ray.head.default:
    resources: {}
    node_config:
      InstanceType: "g4dn.2xlarge"  
      KeyName: ******
      ImageId: "ami-06835d15c4de57810"  #regular nvidia gpu ami
      
  ray.worker.default:
    resources: {}
    min_workers: 1
    max_workers: 2
    node_config:
      InstanceType: g4dn.2xlarge
      KeyName: ******
      ImageId: "ami-06835d15c4de57810"

head_node_type: ray.head.default

max_workers: 2

file_mounts: {
  "/home/ubuntu/app": "./inference_code"
}

setup_commands:
  - docker container prune -f
  - docker image prune -af
  - docker system prune -af

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=:6379 --object-manager-port=8076

worker_setup_commands:
    - sudo usermod -aG docker ubuntu
    - newgrp docker

idle_timeout_minutes: 30

I am not sure why but my head node works just fine and my worker node gets launched in my ec2 setup, but when I ssh into it, the ray_container does not exist - and in the ray dashboard I see it failed to launch. I am also struggling to find any logs as to why it failed to launch

Thanks!

christina · May 8, 2025, 11:27pm

Hi! Can you try ssh’ing into your head node and then checking to see if there’s anything at ~/ray_cluster_launcher.log or maybe /tmp/ray/session_latest/logs/monitor.log? That might have more error messages explaining why Ray is failing to launch.

ainatersol · May 9, 2025, 12:05pm

Resources
---------------------------------------------------------------
Usage:
 3.5/8.0 CPU
 0.31000000000000005/1.0 GPU
 0B/21.34GiB memory
 0B/9.15GiB object_store_memory

Demands:
 (no resource demands)
2025-05-09 04:59:00,075 INFO autoscaler.py:461 -- The autoscaler took 0.098 seconds to complete the update iteration.
2025-05-09 04:59:05,182 INFO autoscaler.py:146 -- The autoscaler took 0.098 seconds to fetch the list of non-terminated nodes.
2025-05-09 04:59:05,182 INFO autoscaler.py:418 -- 
======== Autoscaler status: 2025-05-09 04:59:05.182786 ========
Node status
---------------------------------------------------------------
Active:
 1 ray.head.default
Pending:
 ray.worker.default, 1 launching
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 3.5/8.0 CPU
 0.31000000000000005/1.0 GPU
 0B/21.34GiB memory
 0B/9.15GiB object_store_memory

Demands:
 (no resource demands)
2025-05-09 04:59:05,183 INFO autoscaler.py:461 -- The autoscaler took 0.1 seconds to complete the update iteration.
2025-05-09 04:59:10,298 INFO autoscaler.py:146 -- The autoscaler took 0.106 seconds to fetch the list of non-terminated nodes.
2025-05-09 04:59:10,298 INFO autoscaler.py:418 -- 
======== Autoscaler status: 2025-05-09 04:59:10.298315 ========
Node status
---------------------------------------------------------------
Active:
 1 ray.head.default
Pending:
 ray.worker.default, 1 launching
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 3.5/8.0 CPU
 0.31000000000000005/1.0 GPU
 0B/21.34GiB memory
 0B/9.15GiB object_store_memory

Demands:
 (no resource demands)
2025-05-09 04:59:10,299 INFO autoscaler.py:461 -- The autoscaler took 0.107 seconds to complete the update iteration.

and I do see this when doing ray.init()

(autoscaler +7s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +7s) Adding 1 node(s) of type ray.worker.default.
```

Topic		Replies	Views
Cannot initialize worker nodes on aws cloud Ray Clusters	3	1410	April 8, 2024
Ray worker nodes do not launch when aws configure is run Ray Clusters	2	400	January 19, 2023
Only head node started, not worker nodes Ray Clusters	1	1515	January 19, 2022
Ray workers can't ssh to head node Ray Core	5	767	June 14, 2022
Starting up ray cluster on AWS EC2 instance Ray Clusters	4	1272	April 2, 2024

Worker node fails to launch AWS

Related topics