Worker Nodes Randomly Terminating on GCP Ray Cluster

RunLLM · February 7, 2026, 3:45am

Your logs show that Ray worker nodes are frequently failing during the “RAY_INSTALLING” or “RAY_RUNNING” phases, with messages like “Ray installation failed with unexpected status: setting-up” and “Ray installation failed with unexpected status: waiting-for-ssh”. This pattern is consistent with underlying issues in the node setup or SSH connectivity, which can cause jobs to randomly fail or require resubmission. These issues are commonly reported when using Ray’s autoscaler on GCP, especially with custom images or when the worker setup is not fully reliable. The autoscaler expects the worker to reach a “running” state, but if setup or SSH fails, the node is marked as dead and terminated, as seen in your logs. This is not unique to your config and has been observed by others on GCP as well (see here, here, and here).

The root causes are often: (1) slow or unreliable SSH setup (e.g., due to network, image, or permissions issues), (2) custom images missing required Ray dependencies or SSH configuration, or (3) GCP API rate limits or transient errors. The Ray autoscaler is sensitive to these, and if a worker node does not become reachable or ready in time, it is killed and the job fails. To debug, check the worker node’s /tmp/ray/session_latest/logs for more details, ensure your custom image has all Ray and SSH requirements, and consider increasing timeouts or using a more standard image as a test. For more robust operation, some users have found that using the latest Ray version and a minimal, official Ray image helps, and that pre-building images with all dependencies can reduce setup failures (see this discussion). Would you like a step-by-step checklist to debug and resolve this?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
GCP Cluster Worker Nodes fail to Initialize Ray Clusters	5	597	October 10, 2024
Worker nodes stuck in "waiting-for-ssh" Ray Clusters	8	1868	July 6, 2022
Ray workers can't ssh to head node Ray Core	5	840	June 14, 2022
Worker nodes fail to setup container Ray Clusters	1	747	September 12, 2022
Replicas can't connect to GPUs Ray Serve	9	1739	August 11, 2022

Worker Nodes Randomly Terminating on GCP Ray Cluster

Related topics