Ray up on AWS - unable to initialize workers

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.51.1
  • Python version: 3.10
  • OS: Ubuntu 22.04
  • Cloud/Infrastructure: AWS
  • Other libs/tools (if relevant): -

3. What happened vs. what you expected:

  • Expected:
    After a successful start of the head node, worker(s) should spawn
  • Actual:
    Head node starts successfully, but workers stay in “Pending” state until the auto scaler gives up, tears down the worker EC2, and creates a new EC2. Rinse and repeat…

Workers are in pending state for a bit, and then are replaced with a new worker that stays in pending state. No failure is registered:

======== Autoscaler status: 2025-11-04 08:40:30.285220 ========
Node status
---------------------------------------------------------------
Active:
 (no active nodes)
Idle:
 1 ray.head.default
Pending:
 c2aa720c-7cdc-4e0c-bb87-9e0e04490aa9: ray.worker.default,
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/2.0 CPU
 0B/4.92GiB memory
 0B/2.11GiB object_store_memory

A few moments later, new worker in pending:

======== Autoscaler status: 2025-11-04 08:43:39.105896 ========
Node status
---------------------------------------------------------------
Active:
 (no active nodes)
Idle:
 1 ray.head.default
Pending:
 21b87550-cac3-45a2-9083-c7d98c3f0707: ray.worker.default,
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/2.0 CPU
 0B/4.92GiB memory
 0B/2.11GiB object_store_memory

The monitor.log shows this is due to failure of installing Ray on the worker instance:
Ray installation failed with unexpected status: update-failed
It retries a few times, and gives up, spinning up a new EC2:

2025-11-04 08:41:08,141 INFO scheduler.py:1236 -- Adding 1 nodes to satisfy min count for node type: ray.worker.default.
2025-11-04 08:41:08,142 INFO event_logger.py:76 -- Adding 1 node(s) of type ray.worker.default.
2025-11-04 08:41:08,143 INFO instance_manager.py:247 -- New instance QUEUED (id=21b87550-cac3-45a2-9083-c7d98c3f0707, type=ray.worker.default, cloud_instance_id=, ray_id=): queuing new instance of ray.worker.default from scheduler
2025-11-04 08:41:08,143 INFO instance_manager.py:263 -- Update instance QUEUED->REQUESTED (id=21b87550-cac3-45a2-9083-c7d98c3f0707, type=ray.worker.default, cloud_instance_id=, ray_id=): requested to launch ray.worker.default with request id 77e7ba09-bb38-48f5-a32c-50ae>
2025-11-04 08:41:08,144 INFO instance_manager.py:263 -- Update instance RAY_INSTALL_FAILED->TERMINATING (id=c2aa720c-7cdc-4e0c-bb87-9e0e04490aa9, type=ray.worker.default, cloud_instance_id=i-0d6d9ac89f3a1e5a4, ray_id=): terminating instance from RAY_INSTALL_FAILED
2025-11-04 08:41:08,148 INFO node_provider.py:484 -- Launching 1 nodes of type ray.worker.default.
2025-11-04 08:41:09,179 INFO node_provider.py:488 -- Launched 1 nodes of type ray.worker.default.
2025-11-04 08:41:13,358 INFO instance_manager.py:263 -- Update instance REQUESTED->ALLOCATED (id=21b87550-cac3-45a2-9083-c7d98c3f0707, type=ray.worker.default, cloud_instance_id=, ray_id=): allocated unassigned cloud instance i-0daff3b3e6e521054
2025-11-04 08:41:13,358 INFO instance_manager.py:263 -- Update instance TERMINATING->TERMINATED (id=c2aa720c-7cdc-4e0c-bb87-9e0e04490aa9, type=ray.worker.default, cloud_instance_id=i-0d6d9ac89f3a1e5a4, ray_id=): cloud instance i-0d6d9ac89f3a1e5a4 no longer found
2025-11-04 08:41:13,361 INFO instance_manager.py:263 -- Update instance ALLOCATED->RAY_INSTALLING (id=21b87550-cac3-45a2-9083-c7d98c3f0707, type=ray.worker.default, cloud_instance_id=i-0daff3b3e6e521054, ray_id=): installing ray
2025-11-04 08:41:13,362 INFO config.py:266 -- Using global worker setup commands for ray.worker.default
2025-11-04 08:41:13,362 INFO ray_installer.py:42 -- Creating new (spawn_updater) updater thread for node i-0daff3b3e6e521054.
2025-11-04 08:41:13,362 INFO config.py:283 -- Using global initialization commands for ray.worker.default
2025-11-04 08:42:27,152 INFO threaded_ray_installer.py:81 -- Ray installation failed on instance i-0daff3b3e6e521054: Ray installation failed with unexpected status: update-failed
2025-11-04 08:42:27,153 WARNING threaded_ray_installer.py:86 -- Failed to install ray, retrying...

Looking into the monitor.out, it seems to be due to host not being correctly passed to services.py (it seems to be empty):

2025-11-04 08:38:30,698 VINFO command_runner.py:386 -- Running `^[[1mexport RAY_OVERRIDE_RESOURCES='{"CPU":2,"memory":6012954214}';export RAY_HEAD_IP=10.1.1.57; export RAY_CLOUD_INSTANCE_ID=i-0d6d9ac89f3a1e5a4; export RAY_NODE_TYPE_NAME=ray.worker.default; export RAY_CL>
2025-11-04 08:38:30,699 VVINFO command_runner.py:388 -- Full command is `^[[1mssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax>
Did not find any active Ray processes.
^[[0m2025-11-04 08:38:32,083    VINFO command_runner.py:386 -- Running `^[[1mexport RAY_OVERRIDE_RESOURCES='{"CPU":2,"memory":6012954214}';export RAY_HEAD_IP=10.1.1.57; export RAY_CLOUD_INSTANCE_ID=i-0d6d9ac89f3a1e5a4; export RAY_NODE_TYPE_NAME=ray.worker.default; expor>
2025-11-04 08:38:32,083 VVINFO command_runner.py:388 -- Full command is `^[[1mssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax>
2025-11-04 08:38:33,083 ERROR services.py:538 -- Failed to convert :6379 to host:port
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ray/_private/services.py", line 536, in canonicalize_bootstrap_address
    bootstrap_host = resolve_ip_for_localhost(host)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ray/_private/services.py", line 613, in resolve_ip_for_localhost
    raise ValueError(f"Malformed host: {host}")
ValueError: Malformed host:
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2754, in main
    return cli()
  File "/usr/lib/python3/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ray/autoscaler/_private/cli_logger.py", line 823, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ray/scripts/scripts.py", line 1076, in start
    bootstrap_address = services.canonicalize_bootstrap_address(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ray/_private/services.py", line 536, in canonicalize_bootstrap_address
    bootstrap_host = resolve_ip_for_localhost(host)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ray/_private/services.py", line 613, in resolve_ip_for_localhost
    raise ValueError(f"Malformed host: {host}")
ValueError: Malformed host:
^[[0m2025-11-04 08:38:33,319    INFO log_timer.py:25 -- NodeUpdater: i-0d6d9ac89f3a1e5a4: Ray start commands failed [LogTimer=2621ms]
2025-11-04 08:38:33,319 INFO log_timer.py:25 -- NodeUpdater: i-0d6d9ac89f3a1e5a4: Applied config 66ab20367ca261831afcdaf2c2c39f2b79f69740  [LogTimer=153655ms]
2025-11-04 08:38:34,497 INFO log_timer.py:25 -- AWSNodeProvider: Set tag ray-node-status=update-failed on ['i-0d6d9ac89f3a1e5a4']  [LogTimer=176ms]

Below my cluster_config.yaml:

# An unique identifier for the head node and workers of this cluster.
cluster_name: some-cluster

# Cloud-provider specific configuration.
provider:
    type: aws
    region: eu-west-1
    availability_zone: eu-west-1c,eu-west-1b,eu-west-1a
    cache_stopped_nodes: False
    use_internal_ips: True
    security_group:
      GroupName: some_sg

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 1

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    ssh_private_key: ~/keypair.pem

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g., instance type. By default
        # Ray auto-configures unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: t3a.large
            ImageId: ami-0b016d1e12e0375a8
            KeyName: keypair
            IamInstanceProfile:
                Arn: arn:aws:iam::000000000000:instance-profile/ray-autoscaler
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 32
                      VolumeType: gp3
                      Encrypted: false
            NetworkInterfaces:
                - DeviceIndex: 0
                  SubnetId: subnet-020fe4c15211bacdf
                  AssociatePublicIpAddress: True
                  Groups:
                      - sg-0f56a649020be7786
    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        # The maximum number of worker nodes of this type to launch.
        # This parameter takes precedence over min_workers.
        max_workers: 1
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g., instance type. By default
        # Ray auto-configures unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: t3a.large
            ImageId: ami-0b016d1e12e0375a8
            KeyName: keypair
            IamInstanceProfile:
                Arn: arn:aws:iam::000000000000:instance-profile/ray-autoscaler
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 32
                      VolumeType: gp3
                      Encrypted: false
            NetworkInterfaces:
                - DeviceIndex: 0
                  SubnetId: subnet-020fe4c15211bacdf
                  AssociatePublicIpAddress: True
                  Groups:
                      - sg-0f56a649020be7786

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is set up.
initialization_commands:
  - sudo apt update
  - sudo NEEDRESTART_MODE=a apt install -y python3-pip

# Default setup_commands:
setup_commands:
  - (stat $HOME/anaconda3/envs/tensorflow2_p310/ &> /dev/null && echo 'export PATH="$HOME/anaconda3/envs/tensorflow2_p310/bin:$PATH"' >> ~/.bashrc) || true
  - pip install -U --user ray[default]==2.51.1 boto3
  - export PATH="/home/$USER/.local/bin:$PATH"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0 --dashboard-host=80

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

It seems likely that your Ray worker nodes are failing to join the cluster because the ray start command on the workers is not receiving a valid head node address, resulting in the error “Malformed host:” and a failed Ray installation. This is supported by the traceback in your logs, which shows that the worker is trying to connect to :6379 (missing the host part), causing the autoscaler to repeatedly terminate and relaunch workers. This issue is often due to misconfiguration in the cluster YAML, environment variables, or network setup, which prevents the worker from resolving the head node’s IP address properly during startup.

My understanding is that the Ray autoscaler expects the $RAY_HEAD_IP environment variable to be set correctly so that the worker can connect to the head node using --address=$RAY_HEAD_IP:6379. If this variable is not set or is empty, the worker will fail to start Ray and the autoscaler will mark the node as “update-failed” and try again. You may want to check your YAML for any typos, ensure that the head node is accessible from the worker (network/firewall/subnet), and confirm that the Ray version and Python environment are consistent across nodes. Also, avoid using --user with pip install in setup_commands unless you are certain the PATH is set up correctly, as this can cause binaries to be installed in a location not on the default PATH, leading to command not found errors. The Ray documentation and default AWS config examples may provide further guidance on correct setup and environment variable handling (defaults.yaml example).

Would you like a step-by-step breakdown of how to debug and resolve this issue?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM

I am aware the workers need RAY_HEAD_IP to be not empty. Looking a few lines above the error, it does seem like the auto scaler correctly picked up on it and does set the environment variables (export RAY_HEAD_IP=10.1.1.57):

VINFO command_runner.py:386 -- Running `^[[1mexport RAY_OVERRIDE_RESOURCES='{"CPU":2,"memory":6012954214}';export RAY_HEAD_IP=10.1.1.57; export RAY_CLOUD_INSTANCE_ID=i-0d6d9ac89f3a1e5a4; export RAY_NODE_TYPE_NAME=ray.worker.default; expor>

I can probably remove the --user flag when installing, although I didn’t cause issues before…

You’re correct that the autoscaler appears to set RAY_HEAD_IP (e.g., export RAY_HEAD_IP=10.1.1.57). However, the error Failed to convert :6379 to host:port suggests that, despite the export, the ray start --address=$RAY_HEAD_IP:6379 command is still receiving an empty value for $RAY_HEAD_IP at execution time. This can happen if the environment variable is not properly propagated to the shell where ray start runs, or if the command is executed in a way that doesn’t inherit the export (see with_envs logic).

Removing the --user flag is a good idea, but the core issue is likely with how the environment is set up for the worker start command. Double-check that your worker_start_ray_commands does not override or unset RAY_HEAD_IP, and that no shell quoting or YAML formatting issues are present. Also, ensure that your Ray version matches across nodes, as mismatches can cause subtle failures (see similar issues).

Would you like detailed troubleshooting steps to further isolate the environment propagation issue?

Sources:

Hint: Mention @RunLLM in the post for followups.

We found the culprit after spending way too much time on this, and it turned out to be terraform… An implementation was made to deploy the ray_cluster_config.yaml via user_data which messed-up the environment variables in the cluster config.

I only noticed they where actually not there while inspecting the ray_bootstrap_config.yaml on the head node; the command there was this

ray start --address=:6379 --object-manager-port=8076

After properly escaping all $'s, it was working again…

So, sorry for jumping to conclusions and blaming Ray, you guys are awesome :heart: