Worker Nodes Randomly Terminating on GCP Ray Cluster

Your logs show that Ray worker nodes are frequently failing during the “RAY_INSTALLING” or “RAY_RUNNING” phases, with messages like “Ray installation failed with unexpected status: setting-up” and “Ray installation failed with unexpected status: waiting-for-ssh”. This pattern is consistent with underlying issues in the node setup or SSH connectivity, which can cause jobs to randomly fail or require resubmission. These issues are commonly reported when using Ray’s autoscaler on GCP, especially with custom images or when the worker setup is not fully reliable. The autoscaler expects the worker to reach a “running” state, but if setup or SSH fails, the node is marked as dead and terminated, as seen in your logs. This is not unique to your config and has been observed by others on GCP as well (see here, here, and here).

The root causes are often: (1) slow or unreliable SSH setup (e.g., due to network, image, or permissions issues), (2) custom images missing required Ray dependencies or SSH configuration, or (3) GCP API rate limits or transient errors. The Ray autoscaler is sensitive to these, and if a worker node does not become reachable or ready in time, it is killed and the job fails. To debug, check the worker node’s /tmp/ray/session_latest/logs for more details, ensure your custom image has all Ray and SSH requirements, and consider increasing timeouts or using a more standard image as a test. For more robust operation, some users have found that using the latest Ray version and a minimal, official Ray image helps, and that pre-building images with all dependencies can reduce setup failures (see this discussion). Would you like a step-by-step checklist to debug and resolve this?

Sources:

Hint: Mention @RunLLM in the post for followups.