Hello, I’m not sure if I’m doing things right with my auto-scaling configuration, but given the configuration below, no matter what I do, my workers get created, but never join the cluster and remain “uninitialized”. As far as I can tell the “setup_commands / worker_setup_commands” were never executed on the worker, which explains why they wouldn’t join the cluster. What would cause the commands to not run on the workers? Any help greatly appreciated. For example, where can I see the log where I can see whether setting up the workers actually worked?
Here is my configuration:
cluster_name: ray-autoscaling-cluster
max_workers: 20
idle_timeout_minutes: 5
upscaling_speed: 1.0
provider:
type: aws
region: us-east-1
availability_zone: us-east-1a
cache_stopped_nodes: False
auth:
ssh_user: ubuntu
ssh_private_key: ~/.ssh/sid_EC2.pem
head_node_type: ray.head.default
available_node_types:
ray.head.default:
resources: {“CPU”: 2}
node_config:
InstanceType: m5.large
KeyName: sid_EC2
ImageId: ami-0fcdcdcc9cf0407ae
TagSpecifications:
- ResourceType: instance
Tags:
- Key: ray-node-type
Value: head
SecurityGroupIds: # optional
SubnetId: “” # optional
ray.worker.cpu:
min_workers: 1
max_workers: 4
resources: {“CPU”: 2}
node_config:
InstanceType: m5.large
KeyName: sid_EC2
ImageId: ami-0fcdcdcc9cf0407ae
TagSpecifications:
- ResourceType: instance
Tags:
- Key: ray-node-type
Value: cpu
ray.worker.gpu:
min_workers: 0
max_workers: 4
resources: {“CPU”: 2, “GPU”: 1}
node_config:
InstanceType: g4dn.xlarge
KeyName: sid_EC2
ImageId: ami-0fcdcdcc9cf0407ae
TagSpecifications:
- ResourceType: instance
Tags:
- Key: ray-node-type
Value: gpu
setup_commands:
- sudo apt-get update -y
- sudo apt-get install -y python3 python3-pip
head_setup_commands:
- sudo apt-get update -y
- sudo apt-get install -y python3 python3-pip
- pip3 install -U “ray[default,serve]” boto3
worker_setup_commands:
- sudo apt-get update -y
- sudo apt-get install -y python3 python3-pip
- pip3 install -U “ray[default,serve]” boto3
head_start_ray_commands:
- ray stop
- ulimit -n 65536
- ray start --head --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
- serve start --http-host 0.0.0.0
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379
file_mounts:
~/endpoints: ./endpoints
~/ray-isolated-environments: ./ray-isolated-environments
~/apps.yaml: ./apps.yaml
High: Completely blocks me.
2. Environment:
- Ray version: 2.46
- Python version: 3.12
- OS: Ubuntu
- Cloud/Infrastructure: AWS
- Other libs/tools (if relevant):
Hi Dominic and welcome to the Ray community!
You check the logs on the head node using ray logs cluster_monitor.log
and ray logs cluster_monitor.err
for errors. For workers, the logs are typically located in the /tmp/ray/session_latest/logs
directory. Can you try to locate those for me and let me know what they say?
Hi Christina, thank you for your reply.
The logs unfortunately don’t give much information.
The command ray logs cluster_monitor.log returns this:>
Node IP: < my ip >
— {}
…
And the log files content:
> /tmp/ray/session_latest/logs/worker-…-ffffffff-14189.err
:job_id:01000000
:actor_name:ServeController
> /tmp/ray/session_latest/logs/worker-…-ffffffff-14189.out
:job_id:01000000
:actor_name:ServeController
> /tmp/ray/session_latest/logs/worker-…-ffffffff-14190.err
:job_id:01000000
:actor_name:ProxyActor
INFO 2025-06-03 23:12:14,869 proxy <my_ip> – Proxy starting on node 88f10dda6bb0025878ac7cba9871fa5d6fa6e6a5763e63e88b046e6a (HTTP port: 8000).
INFO 2025-06-03 23:12:14,928 proxy <my_ip> – Got updated endpoints: {}.
> /tmp/ray/session_latest/logs/worker-…-ffffffff-14190.out
:job_id:01000000
:actor_name:ProxyActor
So upon closer inspection your logs don’t seem to show any output from setup or worker setup commands, so it means that your issue must be happening before then. Do you know if your security setup on AWS allows SSH and all necessary ports between head and workers to communicate properly? Are there any logs from the head node?
Hi, here’s some more information.
- Both worker and head nodes are set to the same AWS security group, and so should be able to communicate with each other.
- when I do ray attach on the head node, I can ssh into the worker from the headnode via the specified ssh key (sid_EC2), I’ve even set that key to be the default key used by ssh on the headnode.
- when logged into the worker node, I can ping the headnode
- So both nodes can “see” each other, and can communicate all TCP / SSH traffic.
Observed
The worker node are properly started by the autoscaler, and reflected in the log:
tail -n 100 -f /tmp/ray/session_latest/logs/monitor*
Resources
Total Usage:
0.0/2.0 CPU
0B/4.61GiB memory
0B/1.98GiB object_store_memory
Total Constraints:
(no request_resources() constraints)
Total Demands:
(no resource demands)
2025-06-04 11:42:58,057 INFO autoscaler.py:463 – The autoscaler took 0.065 seconds to complete the update iteration.
==> /tmp/ray/session_latest/logs/monitor.out <==
2025-06-04 11:39:38,876 VINFO utils.py:149 – Creating AWS resource ec2
in us-east-1
2025-06-04 11:39:39,241 VINFO utils.py:149 – Creating AWS resource ec2
in us-east-1
2025-06-04 11:39:50,616 INFO node_provider.py:431 – Launched 2 nodes [subnet_id=subnet-XXXXXXd5cfe43e62d]
2025-06-04 11:39:50,617 INFO node_provider.py:449 – Launched instance i-XXXXXX14f7932baf4 [state=pending, info=pending]
2025-06-04 11:39:50,617 INFO node_provider.py:449 – Launched instance i-XXXXXXc2439e9616d [state=pending, info=pending]
But the worker never initialize and remain in “pending” mode forever. When I SSH into one of the workers, I can tell the worker has never ran the setup_commands, or worker_setup_commands at all.
Furthermore, there is no indication anywhere in the logs on the headnode that an attempt was made at initializing the worker node, as if the headnode never even attempted to SSH into them and run commands.
Maybe there’s an option I can turn on to give out more verbose output when ray configures a worker?
You should be able to add -v
to add verbosity to the logs as specified here: Cluster Management CLI — Ray 2.46.0
Although it’s not mentioned in the docs, some folks have increased the verbosity by adding more v
into the command like this: [autoscaler] Worker node container is not removed after ray down? · Issue #11098 · ray-project/ray · GitHub
So maybe we can try ray up -vvvv
or ray start -vvvv
?
Can you post your config yaml too? I’m wondering if it is an issue with available_node_types
or head_node_type
if the headnode never even attempted to SSH into the workers.
I ended up debuging the Ray autoscaler code and found the problem:
basically, my fault, the ray-node-type was set to “cpu” instead of “worker”. So the entire autoscaler logic was simply filtering out these instances, without logging any errors, warning, or anything.
ray.worker.cpu:
min_workers: 0
max_workers: 4
resources: {“CPU”: 2}
node_config:
InstanceType: m5.large
KeyName: sid_EC2
ImageId: ami-0fcXXXXXc9cf0407ae # Replace with your CPU-only AMI
TagSpecifications:
- ResourceType: instance
Tags:
- Key: ray-node-type
Value: cpu <<<<<<<<<<<<<<<<<<<< WRONG: must be set to “worker”
in any case, thank you for your help and responsiveness!
Great job debugging!! I’ve marked your answer as the solution, hopefully this helps anyone else going through this issue 