Worker Nodes Randomly Terminating on GCP Ray Cluster

1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.

2. Environment:

  • Ray version: 2.53.0
  • Python version: 3.13.11
  • OS: ubuntu
  • Cloud/Infrastructure: GCP (google cloud)
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: When I submit jobs to the cluster, all jobs run to completion.
  • Actual: Jobs will randomly terminate. I can eventually run them by repeatedly resubmitting the job.

I have a config.yaml which I will post below. I use “ray up config.yaml” to set up my cluster on GCP. The head node boots successfully. I try to submit a job with “ray job submit --address=“$RAY_ADDRESS” --no-wait --entrypoint-num-gpus=1 – bash -c “$CMD”“. I submit 11 jobs in total. Some random subset of the jobs will start to run. The rest will fail. I show the error logs which I took from the head node under /tmp/ray/session_latest/logs/monitor.log. I then resubmit the failed jobs. Some will run while others fail. Eventually I can get all jobs to run, but this defeats the purpose of using Ray. The error messages are very cryptic. Perhaps a developer recognizes some of these and can point me the in the right direction?

Config.yaml

# An unique identifier for the head node and workers of this cluster.
cluster_name: l4-cluster

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 48

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker: {}

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 10

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-central1
    availability_zone: us-central1-a
    project_id: seqnn-170614 # Globally unique project id
    cache_stopped_nodes: True

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: davidwang
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
#    ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray_head_default:
        # The resources provided by this node type.
        resources: {"CPU": 2}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/seqnn-170614/global/images/ray-dldsc-py313-v1
                  #sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-2-7-cu128-ubuntu-2204-nvidia-570
                  #sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
                  

            # Additional options can be found in in the compute docs at
            # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

            # If the network interface is specified as below in both head and worker
            # nodes, the manual network config is used.  Otherwise an existing subnet is
            # used.  To use a shared subnet, ask the subnet owner to grant permission
            # for 'compute.subnetworks.use' to the ray autoscaler account...
            # networkInterfaces:
            #   - kind: compute#networkInterface
            #     subnetwork: path/to/subnet
            #     aliasIpRanges: []
    ray_worker_small:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 48
        # The resources provided by this node type.
        resources: {"CPU": 32, "GPU": 1}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: g2-standard-32
            scheduling:
              provisioningModel: STANDARD
              onHostMaintenance: TERMINATE
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100 # Node will not boot with under 100
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/seqnn-170614/global/images/ray-dldsc-py313-v1
                  #sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-2-7-cu128-ubuntu-2204-nvidia-570
                  #sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
              - boot: false
                source: projects/seqnn-170614/zones/us-central1-a/disks/dldsc-read-only
                autoDelete: false   # IMPORTANT: Don't delete my data when the worker dies!
                mode: READ_ONLY     # IMPORTANT: Allows multiple VMs to attach at once
                deviceName: dldsc_disk
              #- boot: false
              #  autoDelete: true
              #  type: SCRATCH
              #  interface: NVME
              #  initializeParams:
              #    diskType: projects/seqnn-170614/zones/us-central1-a/diskTypes/local-ssd
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand.
            #scheduling:
            #  - preemptible: true
            # Un-Comment this to launch workers with the Service Account of the Head Node
            serviceAccounts:
              - email: 1090276179925-compute@developer.gserviceaccount.com
                scopes:
                  - https://www.googleapis.com/auth/cloud-platform

    # Additional options can be found in in the compute docs at
    # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

# Specify the node type of the head node (as configured above).
head_node_type: ray_head_default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
    "/home/davidwang/dldsc": "/home/davidwang/code_repos/dldsc",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
#  - curl -fsSL https://get.docker.com -o get-docker.sh
#  - sudo sh get-docker.sh
#  - sudo usermod -aG docker $USER
#  - sudo systemctl restart docker -f

# List of shell commands to run to set up nodes.
setup_commands: []
  #- pip install -U "ray[default]"
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"


# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip install google-api-python-client==1.7.8

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands:
  #- sudo mkdir /mnt/local_ssd
  #- sudo mkfs.ext4 -F /dev/nvme0n1
  #- sudo mount /dev/nvme0n1 /mnt/local_ssd
  #- sudo chmod 777 /mnt/local_ssd
  - sudo mkdir -p /mnt/data
  - sudo mount -o ro,noload /dev/disk/by-id/google-dldsc_disk /mnt/data

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

monitor.log. I’m isolating the lines for 1 worker: ray-l4-cluster-worker-1c19f7f0

(base) davidwang@ray-l4-cluster-head-81b470aa-compute:~$ cat /tmp/ray/session_latest/logs/monitor.log | grep "ray-l4-cluster-worker-1c19f7f0"
2026-02-07 03:00:36,684 INFO instance_manager.py:263 -- Update instance REQUESTED->ALLOCATED (id=83e8f242-0a06-4479-bcbf-c0d609589b73, type=ray_worker_small, cloud_instance_id=, ray_id=): allocated unassigned cloud instance ray-l4-cluster-worker-1c19f7f0-compute
2026-02-07 03:00:36,725 INFO instance_manager.py:263 -- Update instance ALLOCATED->RAY_INSTALLING (id=83e8f242-0a06-4479-bcbf-c0d609589b73, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=): installing ray
2026-02-07 03:00:36,727 INFO ray_installer.py:42 -- Creating new (spawn_updater) updater thread for node ray-l4-cluster-worker-1c19f7f0-compute.
2026-02-07 03:00:36,771 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute?alt=json
2026-02-07 03:00:57,346 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:02:09,829 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:03:41,723 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:05:26,336 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:05:32,007 INFO threaded_ray_installer.py:81 -- Ray installation failed on instance ray-l4-cluster-worker-1c19f7f0-compute: Ray installation failed with unexpected status: setting-up
2026-02-07 03:05:32,020 INFO instance_manager.py:263 -- Update instance RAY_INSTALLING->RAY_RUNNING (id=83e8f242-0a06-4479-bcbf-c0d609589b73, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=): ray node f4a319781d2a79da0195f889a138e02c43a4e038b671ed752f850af6 is RUNNING
2026-02-07 03:05:42,008 INFO ray_installer.py:42 -- Creating new (spawn_updater) updater thread for node ray-l4-cluster-worker-1c19f7f0-compute.
2026-02-07 03:05:42,026 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute?alt=json
2026-02-07 03:07:17,116 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:07:46,233 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:07:51,820 INFO threaded_ray_installer.py:81 -- Ray installation failed on instance ray-l4-cluster-worker-1c19f7f0-compute: Ray installation failed with unexpected status: waiting-for-ssh
2026-02-07 03:07:51,822 INFO instance_manager.py:263 -- Update instance RAY_RUNNING->RAY_STOPPED (id=83e8f242-0a06-4479-bcbf-c0d609589b73, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=f4a319781d2a79da0195f889a138e02c43a4e038b671ed752f850af6): ray node f4a319781d2a79da0195f889a138e02c43a4e038b671ed752f850af6 is DEAD
2026-02-07 03:07:51,832 INFO instance_manager.py:263 -- Update instance RAY_STOPPED->TERMINATING (id=83e8f242-0a06-4479-bcbf-c0d609589b73, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=f4a319781d2a79da0195f889a138e02c43a4e038b671ed752f850af6): terminating instance from RAY_STOPPED
2026-02-07 03:07:51,834 INFO node_provider.py:184 -- NodeProvider: ray-l4-cluster-worker-1c19f7f0-compute: Terminating node
2026-02-07 03:07:51,850 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/stop?alt=json
2026-02-07 03:08:11,824 INFO ray_installer.py:42 -- Creating new (spawn_updater) updater thread for node ray-l4-cluster-worker-1c19f7f0-compute.
2026-02-07 03:08:11,838 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute?alt=json
2026-02-07 03:09:00,506 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute?alt=json
2026-02-07 03:09:00,784 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:09:01,413 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute?alt=json
2026-02-07 03:09:01,570 INFO instance_manager.py:263 -- Update instance TERMINATING->TERMINATED (id=83e8f242-0a06-4479-bcbf-c0d609589b73, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=f4a319781d2a79da0195f889a138e02c43a4e038b671ed752f850af6): cloud instance ray-l4-cluster-worker-1c19f7f0-compute no longer found
2026-02-07 03:09:01,571 INFO instance_manager.py:263 -- Update instance ALLOCATED->RAY_RUNNING (id=1725d5bf-95ff-4770-8326-60e5b29baa72, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=7edd216a9666384e1daea9bbc1a3ed64914941c12b54d2d084b962ed): ray node 7edd216a9666384e1daea9bbc1a3ed64914941c12b54d2d084b962ed is IDLE
2026-02-07 03:09:01,768 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute?alt=json
2026-02-07 03:09:01,936 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:09:02,420 INFO threaded_ray_installer.py:81 -- Ray installation failed on instance ray-l4-cluster-worker-1c19f7f0-compute: wait_ready aborting because node detected as terminated.
2026-02-07 03:10:00,055 INFO instance_manager.py:263 -- Update instance RAY_RUNNING->TERMINATED (id=1725d5bf-95ff-4770-8326-60e5b29baa72, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=7edd216a9666384e1daea9bbc1a3ed64914941c12b54d2d084b962ed): cloud instance ray-l4-cluster-worker-1c19f7f0-compute no longer found
2026-02-07 03:18:20,680 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/start?alt=json
2026-02-07 03:18:53,025 INFO instance_manager.py:263 -- Update instance REQUESTED->ALLOCATED (id=48278b4d-667d-4ab4-ae70-9192e7e79da6, type=ray_worker_small, cloud_instance_id=, ray_id=): allocated unassigned cloud instance ray-l4-cluster-worker-1c19f7f0-compute
2026-02-07 03:18:53,077 INFO instance_manager.py:263 -- Update instance ALLOCATED->RAY_INSTALLING (id=48278b4d-667d-4ab4-ae70-9192e7e79da6, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=): installing ray
2026-02-07 03:18:53,080 INFO ray_installer.py:42 -- Creating new (spawn_updater) updater thread for node ray-l4-cluster-worker-1c19f7f0-compute.
2026-02-07 03:18:53,212 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute?alt=json
2026-02-07 03:19:32,332 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:20:22,047 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/setLabels?alt=json
2026-02-07 03:20:41,642 INFO instance_manager.py:263 -- Update instance RAY_INSTALLING->RAY_RUNNING (id=48278b4d-667d-4ab4-ae70-9192e7e79da6, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=): ray node 3fde162a4386aa16a5746c1bf0318108b366e060371c5ba1606d79e0 is RUNNING
2026-02-07 03:32:07,731 INFO instance_manager.py:263 -- Update instance RAY_RUNNING->RAY_STOP_REQUESTED (id=48278b4d-667d-4ab4-ae70-9192e7e79da6, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=3fde162a4386aa16a5746c1bf0318108b366e060371c5ba1606d79e0): draining ray: idle for 604.971 secs > timeout=600 secs
2026-02-07 03:32:13,126 INFO instance_manager.py:263 -- Update instance RAY_STOP_REQUESTED->RAY_STOPPED (id=48278b4d-667d-4ab4-ae70-9192e7e79da6, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=3fde162a4386aa16a5746c1bf0318108b366e060371c5ba1606d79e0): ray node 3fde162a4386aa16a5746c1bf0318108b366e060371c5ba1606d79e0 is DEAD
2026-02-07 03:32:13,144 INFO instance_manager.py:263 -- Update instance RAY_STOPPED->TERMINATING (id=48278b4d-667d-4ab4-ae70-9192e7e79da6, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=3fde162a4386aa16a5746c1bf0318108b366e060371c5ba1606d79e0): terminating instance from RAY_STOPPED
2026-02-07 03:33:15,528 INFO node_provider.py:184 -- NodeProvider: ray-l4-cluster-worker-1c19f7f0-compute: Terminating node
2026-02-07 03:33:15,818 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/seqnn-170614/zones/us-central1-a/instances/ray-l4-cluster-worker-1c19f7f0-compute/stop?alt=json
2026-02-07 03:35:05,516 INFO instance_manager.py:263 -- Update instance TERMINATING->TERMINATED (id=48278b4d-667d-4ab4-ae70-9192e7e79da6, type=ray_worker_small, cloud_instance_id=ray-l4-cluster-worker-1c19f7f0-compute, ray_id=3fde162a4386aa16a5746c1bf0318108b366e060371c5ba1606d79e0): cloud instance ray-l4-cluster-worker-1c19f7f0-compute no longer found

Your logs show that Ray worker nodes are frequently failing during the “RAY_INSTALLING” or “RAY_RUNNING” phases, with messages like “Ray installation failed with unexpected status: setting-up” and “Ray installation failed with unexpected status: waiting-for-ssh”. This pattern is consistent with underlying issues in the node setup or SSH connectivity, which can cause jobs to randomly fail or require resubmission. These issues are commonly reported when using Ray’s autoscaler on GCP, especially with custom images or when the worker setup is not fully reliable. The autoscaler expects the worker to reach a “running” state, but if setup or SSH fails, the node is marked as dead and terminated, as seen in your logs. This is not unique to your config and has been observed by others on GCP as well (see here, here, and here).

The root causes are often: (1) slow or unreliable SSH setup (e.g., due to network, image, or permissions issues), (2) custom images missing required Ray dependencies or SSH configuration, or (3) GCP API rate limits or transient errors. The Ray autoscaler is sensitive to these, and if a worker node does not become reachable or ready in time, it is killed and the job fails. To debug, check the worker node’s /tmp/ray/session_latest/logs for more details, ensure your custom image has all Ray and SSH requirements, and consider increasing timeouts or using a more standard image as a test. For more robust operation, some users have found that using the latest Ray version and a minimal, official Ray image helps, and that pre-building images with all dependencies can reduce setup failures (see this discussion). Would you like a step-by-step checklist to debug and resolve this?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Can you explain how to increase timeout for solving this problem?

It seems likely that the “waiting-for-ssh” status is due to Ray’s autoscaler timing out while waiting for SSH access to the worker node. However, based on the available data sources, there is no explicit user-configurable parameter in the Ray YAML or CLI to increase the SSH wait timeout for GCP clusters. The timeout for SSH is hardcoded in the Ray autoscaler code, and several users have requested making it configurable, but as of now, it is not exposed as a cluster config or environment variable (see here, here).

You may be able to work around this by ensuring your nodes are faster to initialize (e.g., use smaller images, avoid slow setup commands), or by modifying the Ray source code directly to increase the timeout. If you want to try a code change, you would need to adjust the AUTOSCALER_NODE_START_WAIT_S constant in the Ray autoscaler Python files, but this is not officially documented or supported. Would you like more detail on how to locate and modify this timeout in the Ray source code?

Sources:

Hint: Mention @RunLLM in the post for followups.