1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.51.1
- Python version: 3.10
- OS: Ubuntu 22.04
- Cloud/Infrastructure: AWS
- Other libs/tools (if relevant): -
3. What happened vs. what you expected:
- Expected:
After a successful start of the head node, worker(s) should spawn - Actual:
Head node starts successfully, but workers stay in “Pending” state until the auto scaler gives up, tears down the worker EC2, and creates a new EC2. Rinse and repeat…
Workers are in pending state for a bit, and then are replaced with a new worker that stays in pending state. No failure is registered:
======== Autoscaler status: 2025-11-04 08:40:30.285220 ========
Node status
---------------------------------------------------------------
Active:
(no active nodes)
Idle:
1 ray.head.default
Pending:
c2aa720c-7cdc-4e0c-bb87-9e0e04490aa9: ray.worker.default,
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
0.0/2.0 CPU
0B/4.92GiB memory
0B/2.11GiB object_store_memory
A few moments later, new worker in pending:
======== Autoscaler status: 2025-11-04 08:43:39.105896 ========
Node status
---------------------------------------------------------------
Active:
(no active nodes)
Idle:
1 ray.head.default
Pending:
21b87550-cac3-45a2-9083-c7d98c3f0707: ray.worker.default,
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
0.0/2.0 CPU
0B/4.92GiB memory
0B/2.11GiB object_store_memory
The monitor.log shows this is due to failure of installing Ray on the worker instance:
Ray installation failed with unexpected status: update-failed
It retries a few times, and gives up, spinning up a new EC2:
2025-11-04 08:41:08,141 INFO scheduler.py:1236 -- Adding 1 nodes to satisfy min count for node type: ray.worker.default.
2025-11-04 08:41:08,142 INFO event_logger.py:76 -- Adding 1 node(s) of type ray.worker.default.
2025-11-04 08:41:08,143 INFO instance_manager.py:247 -- New instance QUEUED (id=21b87550-cac3-45a2-9083-c7d98c3f0707, type=ray.worker.default, cloud_instance_id=, ray_id=): queuing new instance of ray.worker.default from scheduler
2025-11-04 08:41:08,143 INFO instance_manager.py:263 -- Update instance QUEUED->REQUESTED (id=21b87550-cac3-45a2-9083-c7d98c3f0707, type=ray.worker.default, cloud_instance_id=, ray_id=): requested to launch ray.worker.default with request id 77e7ba09-bb38-48f5-a32c-50ae>
2025-11-04 08:41:08,144 INFO instance_manager.py:263 -- Update instance RAY_INSTALL_FAILED->TERMINATING (id=c2aa720c-7cdc-4e0c-bb87-9e0e04490aa9, type=ray.worker.default, cloud_instance_id=i-0d6d9ac89f3a1e5a4, ray_id=): terminating instance from RAY_INSTALL_FAILED
2025-11-04 08:41:08,148 INFO node_provider.py:484 -- Launching 1 nodes of type ray.worker.default.
2025-11-04 08:41:09,179 INFO node_provider.py:488 -- Launched 1 nodes of type ray.worker.default.
2025-11-04 08:41:13,358 INFO instance_manager.py:263 -- Update instance REQUESTED->ALLOCATED (id=21b87550-cac3-45a2-9083-c7d98c3f0707, type=ray.worker.default, cloud_instance_id=, ray_id=): allocated unassigned cloud instance i-0daff3b3e6e521054
2025-11-04 08:41:13,358 INFO instance_manager.py:263 -- Update instance TERMINATING->TERMINATED (id=c2aa720c-7cdc-4e0c-bb87-9e0e04490aa9, type=ray.worker.default, cloud_instance_id=i-0d6d9ac89f3a1e5a4, ray_id=): cloud instance i-0d6d9ac89f3a1e5a4 no longer found
2025-11-04 08:41:13,361 INFO instance_manager.py:263 -- Update instance ALLOCATED->RAY_INSTALLING (id=21b87550-cac3-45a2-9083-c7d98c3f0707, type=ray.worker.default, cloud_instance_id=i-0daff3b3e6e521054, ray_id=): installing ray
2025-11-04 08:41:13,362 INFO config.py:266 -- Using global worker setup commands for ray.worker.default
2025-11-04 08:41:13,362 INFO ray_installer.py:42 -- Creating new (spawn_updater) updater thread for node i-0daff3b3e6e521054.
2025-11-04 08:41:13,362 INFO config.py:283 -- Using global initialization commands for ray.worker.default
2025-11-04 08:42:27,152 INFO threaded_ray_installer.py:81 -- Ray installation failed on instance i-0daff3b3e6e521054: Ray installation failed with unexpected status: update-failed
2025-11-04 08:42:27,153 WARNING threaded_ray_installer.py:86 -- Failed to install ray, retrying...
Looking into the monitor.out, it seems to be due to host not being correctly passed to services.py (it seems to be empty):
2025-11-04 08:38:30,698 VINFO command_runner.py:386 -- Running `^[[1mexport RAY_OVERRIDE_RESOURCES='{"CPU":2,"memory":6012954214}';export RAY_HEAD_IP=10.1.1.57; export RAY_CLOUD_INSTANCE_ID=i-0d6d9ac89f3a1e5a4; export RAY_NODE_TYPE_NAME=ray.worker.default; export RAY_CL>
2025-11-04 08:38:30,699 VVINFO command_runner.py:388 -- Full command is `^[[1mssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax>
Did not find any active Ray processes.
^[[0m2025-11-04 08:38:32,083 VINFO command_runner.py:386 -- Running `^[[1mexport RAY_OVERRIDE_RESOURCES='{"CPU":2,"memory":6012954214}';export RAY_HEAD_IP=10.1.1.57; export RAY_CLOUD_INSTANCE_ID=i-0d6d9ac89f3a1e5a4; export RAY_NODE_TYPE_NAME=ray.worker.default; expor>
2025-11-04 08:38:32,083 VVINFO command_runner.py:388 -- Full command is `^[[1mssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax>
2025-11-04 08:38:33,083 ERROR services.py:538 -- Failed to convert :6379 to host:port
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.10/site-packages/ray/_private/services.py", line 536, in canonicalize_bootstrap_address
bootstrap_host = resolve_ip_for_localhost(host)
File "/home/ubuntu/.local/lib/python3.10/site-packages/ray/_private/services.py", line 613, in resolve_ip_for_localhost
raise ValueError(f"Malformed host: {host}")
ValueError: Malformed host:
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/.local/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2754, in main
return cli()
File "/usr/lib/python3/dist-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python3/dist-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/usr/lib/python3/dist-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3/dist-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3/dist-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/ray/autoscaler/_private/cli_logger.py", line 823, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/ray/scripts/scripts.py", line 1076, in start
bootstrap_address = services.canonicalize_bootstrap_address(
File "/home/ubuntu/.local/lib/python3.10/site-packages/ray/_private/services.py", line 536, in canonicalize_bootstrap_address
bootstrap_host = resolve_ip_for_localhost(host)
File "/home/ubuntu/.local/lib/python3.10/site-packages/ray/_private/services.py", line 613, in resolve_ip_for_localhost
raise ValueError(f"Malformed host: {host}")
ValueError: Malformed host:
^[[0m2025-11-04 08:38:33,319 INFO log_timer.py:25 -- NodeUpdater: i-0d6d9ac89f3a1e5a4: Ray start commands failed [LogTimer=2621ms]
2025-11-04 08:38:33,319 INFO log_timer.py:25 -- NodeUpdater: i-0d6d9ac89f3a1e5a4: Applied config 66ab20367ca261831afcdaf2c2c39f2b79f69740 [LogTimer=153655ms]
2025-11-04 08:38:34,497 INFO log_timer.py:25 -- AWSNodeProvider: Set tag ray-node-status=update-failed on ['i-0d6d9ac89f3a1e5a4'] [LogTimer=176ms]
Below my cluster_config.yaml:
# An unique identifier for the head node and workers of this cluster.
cluster_name: some-cluster
# Cloud-provider specific configuration.
provider:
type: aws
region: eu-west-1
availability_zone: eu-west-1c,eu-west-1b,eu-west-1a
cache_stopped_nodes: False
use_internal_ips: True
security_group:
GroupName: some_sg
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 1
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
ssh_private_key: ~/keypair.pem
# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray.head.default:
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
# You can also set custom resources.
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
resources: {}
# Provider-specific config for this node type, e.g., instance type. By default
# Ray auto-configures unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
node_config:
InstanceType: t3a.large
ImageId: ami-0b016d1e12e0375a8
KeyName: keypair
IamInstanceProfile:
Arn: arn:aws:iam::000000000000:instance-profile/ray-autoscaler
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 32
VolumeType: gp3
Encrypted: false
NetworkInterfaces:
- DeviceIndex: 0
SubnetId: subnet-020fe4c15211bacdf
AssociatePublicIpAddress: True
Groups:
- sg-0f56a649020be7786
ray.worker.default:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0.
min_workers: 1
# The maximum number of worker nodes of this type to launch.
# This parameter takes precedence over min_workers.
max_workers: 1
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
# You can also set custom resources.
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
resources: {}
# Provider-specific config for this node type, e.g., instance type. By default
# Ray auto-configures unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
node_config:
InstanceType: t3a.large
ImageId: ami-0b016d1e12e0375a8
KeyName: keypair
IamInstanceProfile:
Arn: arn:aws:iam::000000000000:instance-profile/ray-autoscaler
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 32
VolumeType: gp3
Encrypted: false
NetworkInterfaces:
- DeviceIndex: 0
SubnetId: subnet-020fe4c15211bacdf
AssociatePublicIpAddress: True
Groups:
- sg-0f56a649020be7786
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {}
# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is set up.
initialization_commands:
- sudo apt update
- sudo NEEDRESTART_MODE=a apt install -y python3-pip
# Default setup_commands:
setup_commands:
- (stat $HOME/anaconda3/envs/tensorflow2_p310/ &> /dev/null && echo 'export PATH="$HOME/anaconda3/envs/tensorflow2_p310/bin:$PATH"' >> ~/.bashrc) || true
- pip install -U --user ray[default]==2.51.1 boto3
- export PATH="/home/$USER/.local/bin:$PATH"
# Custom commands that will be run on the head node after common setup.
head_setup_commands: []
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0 --dashboard-host=80
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076