Workers never initialize

Hello, I’m not sure if I’m doing things right with my auto-scaling configuration, but given the configuration below, no matter what I do, my workers get created, but never join the cluster and remain “uninitialized”. As far as I can tell the “setup_commands / worker_setup_commands” were never executed on the worker, which explains why they wouldn’t join the cluster. What would cause the commands to not run on the workers? Any help greatly appreciated. For example, where can I see the log where I can see whether setting up the workers actually worked?

Here is my configuration:

cluster_name: ray-autoscaling-cluster

max_workers: 20
idle_timeout_minutes: 5
upscaling_speed: 1.0

provider:
type: aws
region: us-east-1
availability_zone: us-east-1a
cache_stopped_nodes: False

auth:
ssh_user: ubuntu
ssh_private_key: ~/.ssh/sid_EC2.pem

head_node_type: ray.head.default

available_node_types:
ray.head.default:
resources: {“CPU”: 2}
node_config:
InstanceType: m5.large
KeyName: sid_EC2
ImageId: ami-0fcdcdcc9cf0407ae
TagSpecifications:
- ResourceType: instance
Tags:
- Key: ray-node-type
Value: head
SecurityGroupIds: # optional
SubnetId: “” # optional

ray.worker.cpu:
min_workers: 1
max_workers: 4
resources: {“CPU”: 2}
node_config:
InstanceType: m5.large
KeyName: sid_EC2
ImageId: ami-0fcdcdcc9cf0407ae
TagSpecifications:
- ResourceType: instance
Tags:
- Key: ray-node-type
Value: cpu

ray.worker.gpu:
min_workers: 0
max_workers: 4
resources: {“CPU”: 2, “GPU”: 1}
node_config:
InstanceType: g4dn.xlarge
KeyName: sid_EC2
ImageId: ami-0fcdcdcc9cf0407ae
TagSpecifications:
- ResourceType: instance
Tags:
- Key: ray-node-type
Value: gpu

setup_commands:

  • sudo apt-get update -y
  • sudo apt-get install -y python3 python3-pip

head_setup_commands:

  • sudo apt-get update -y
  • sudo apt-get install -y python3 python3-pip
  • pip3 install -U “ray[default,serve]” boto3

worker_setup_commands:

  • sudo apt-get update -y
  • sudo apt-get install -y python3 python3-pip
  • pip3 install -U “ray[default,serve]” boto3

head_start_ray_commands:

  • ray stop
  • ulimit -n 65536
  • ray start --head --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
  • serve start --http-host 0.0.0.0

worker_start_ray_commands:

  • ray stop
  • ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379

file_mounts:
~/endpoints: ./endpoints
~/ray-isolated-environments: ./ray-isolated-environments
~/apps.yaml: ./apps.yaml

High: Completely blocks me.

2. Environment:

  • Ray version: 2.46
  • Python version: 3.12
  • OS: Ubuntu
  • Cloud/Infrastructure: AWS
  • Other libs/tools (if relevant):

Hi Dominic and welcome to the Ray community!
You check the logs on the head node using ray logs cluster_monitor.log and ray logs cluster_monitor.err for errors. For workers, the logs are typically located in the /tmp/ray/session_latest/logs directory. Can you try to locate those for me and let me know what they say?

Hi Christina, thank you for your reply.
The logs unfortunately don’t give much information.
The command ray logs cluster_monitor.log returns this:>

Node IP: < my ip >
— {}

And the log files content:

> /tmp/ray/session_latest/logs/worker-…-ffffffff-14189.err
:job_id:01000000
:actor_name:ServeController

> /tmp/ray/session_latest/logs/worker-…-ffffffff-14189.out
:job_id:01000000
:actor_name:ServeController

> /tmp/ray/session_latest/logs/worker-…-ffffffff-14190.err
:job_id:01000000
:actor_name:ProxyActor
INFO 2025-06-03 23:12:14,869 proxy <my_ip> – Proxy starting on node 88f10dda6bb0025878ac7cba9871fa5d6fa6e6a5763e63e88b046e6a (HTTP port: 8000).
INFO 2025-06-03 23:12:14,928 proxy <my_ip> – Got updated endpoints: {}.

> /tmp/ray/session_latest/logs/worker-…-ffffffff-14190.out
:job_id:01000000
:actor_name:ProxyActor

So upon closer inspection your logs don’t seem to show any output from setup or worker setup commands, so it means that your issue must be happening before then. Do you know if your security setup on AWS allows SSH and all necessary ports between head and workers to communicate properly? Are there any logs from the head node?

Hi, here’s some more information.

  • Both worker and head nodes are set to the same AWS security group, and so should be able to communicate with each other.
  • when I do ray attach on the head node, I can ssh into the worker from the headnode via the specified ssh key (sid_EC2), I’ve even set that key to be the default key used by ssh on the headnode.
  • when logged into the worker node, I can ping the headnode
  • So both nodes can “see” each other, and can communicate all TCP / SSH traffic.

Observed
The worker node are properly started by the autoscaler, and reflected in the log:

tail -n 100 -f /tmp/ray/session_latest/logs/monitor*

Resources

Total Usage:
0.0/2.0 CPU
0B/4.61GiB memory
0B/1.98GiB object_store_memory

Total Constraints:
(no request_resources() constraints)
Total Demands:
(no resource demands)
2025-06-04 11:42:58,057 INFO autoscaler.py:463 – The autoscaler took 0.065 seconds to complete the update iteration.

==> /tmp/ray/session_latest/logs/monitor.out <==
2025-06-04 11:39:38,876 VINFO utils.py:149 – Creating AWS resource ec2 in us-east-1
2025-06-04 11:39:39,241 VINFO utils.py:149 – Creating AWS resource ec2 in us-east-1
2025-06-04 11:39:50,616 INFO node_provider.py:431 – Launched 2 nodes [subnet_id=subnet-XXXXXXd5cfe43e62d]
2025-06-04 11:39:50,617 INFO node_provider.py:449 – Launched instance i-XXXXXX14f7932baf4 [state=pending, info=pending]
2025-06-04 11:39:50,617 INFO node_provider.py:449 – Launched instance i-XXXXXXc2439e9616d [state=pending, info=pending]

But the worker never initialize and remain in “pending” mode forever. When I SSH into one of the workers, I can tell the worker has never ran the setup_commands, or worker_setup_commands at all.
Furthermore, there is no indication anywhere in the logs on the headnode that an attempt was made at initializing the worker node, as if the headnode never even attempted to SSH into them and run commands.

Maybe there’s an option I can turn on to give out more verbose output when ray configures a worker?

You should be able to add -v to add verbosity to the logs as specified here: Cluster Management CLI — Ray 2.46.0

Although it’s not mentioned in the docs, some folks have increased the verbosity by adding more v into the command like this: [autoscaler] Worker node container is not removed after ray down? · Issue #11098 · ray-project/ray · GitHub

So maybe we can try ray up -vvvv or ray start -vvvv ?

Can you post your config yaml too? I’m wondering if it is an issue with available_node_types or head_node_type if the headnode never even attempted to SSH into the workers.

I ended up debuging the Ray autoscaler code and found the problem:
basically, my fault, the ray-node-type was set to “cpu” instead of “worker”. So the entire autoscaler logic was simply filtering out these instances, without logging any errors, warning, or anything.

ray.worker.cpu:
min_workers: 0
max_workers: 4
resources: {“CPU”: 2}
node_config:
InstanceType: m5.large
KeyName: sid_EC2
ImageId: ami-0fcXXXXXc9cf0407ae # Replace with your CPU-only AMI
TagSpecifications:
- ResourceType: instance
Tags:
- Key: ray-node-type
Value: cpu <<<<<<<<<<<<<<<<<<<< WRONG: must be set to “worker”

in any case, thank you for your help and responsiveness!

Great job debugging!! I’ve marked your answer as the solution, hopefully this helps anyone else going through this issue :slight_smile: