Worker Nodes Randomly Terminating on GCP Ray Cluster

RunLLM · February 7, 2026, 4:11am

It seems likely that the “waiting-for-ssh” status is due to Ray’s autoscaler timing out while waiting for SSH access to the worker node. However, based on the available data sources, there is no explicit user-configurable parameter in the Ray YAML or CLI to increase the SSH wait timeout for GCP clusters. The timeout for SSH is hardcoded in the Ray autoscaler code, and several users have requested making it configurable, but as of now, it is not exposed as a cluster config or environment variable (see here, here).

You may be able to work around this by ensuring your nodes are faster to initialize (e.g., use smaller images, avoid slow setup commands), or by modifying the Ray source code directly to increase the timeout. If you want to try a code change, you would need to adjust the AUTOSCALER_NODE_START_WAIT_S constant in the Ray autoscaler Python files, but this is not officially documented or supported. Would you like more detail on how to locate and modify this timeout in the Ray source code?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
GCP Cluster Worker Nodes fail to Initialize Ray Clusters	5	574	October 10, 2024
Worker nodes stuck in "waiting-for-ssh" Ray Clusters	8	1841	July 6, 2022
Ray workers can't ssh to head node Ray Core	5	815	June 14, 2022
Worker nodes fail to setup container Ray Clusters	1	737	September 12, 2022
Replicas can't connect to GPUs Ray Serve	9	1709	August 11, 2022

Worker Nodes Randomly Terminating on GCP Ray Cluster

Related topics