GCP Cluster Worker Nodes fail to Initialize

Vrushank_Desai · October 12, 2023, 1:46am

How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.

Hi, as of last few weeks I have had increasing number of errors faced during spool up of worker-nodes on GCP by the Ray autoscaler. It has gotten to the point where I basically can’t scale clusters past a handful of worker nodes. I am able to reproduce this using an almost exact duplicate of the Ray example YAML for GCP, a script that runs dummy remote tasks, and across new conda envs on the local node launching the cluster with Ray versions 2.3.1, 2.7.0, 2.7.1. The consistency across Ray versions makes me think some sort of issue with a recent GCP change. Things I have tried to fix this issue include:

-as mentioned, trying multiple Ray versions on new environments
-different GCP instance types
-different GCP instance images with differnent Python versions
-pip installing various versions of google-python-api-client in cluster init commands

My monitor.err in all cases is full of Exceptions related to read time-outs and SSL Errors. It’s worth noting that a co-worker ran the same job on AWS with a similar config and had no issues scaling to 50-nodes in 6 mins, while I can’t get past 5 nodes after 1hr+. This bug has made Ray unusable for me so any help would be very much appreciated!

Logs from four different attempts - Ray Error Logs - Google Drive

YAML config -

Dummy Python payload that I ran on head node -

PaulFenton · November 11, 2023, 1:43pm

@Vrushank_Desai could you try filling-in and un-commenting this section?:

    # serviceAccounts:
    # - email: ray-autoscaler-sa-v1@<project_id>.iam.gserviceaccount.com
    #   scopes:
    #   - https://www.googleapis.com/auth/cloud-platform

I had a variety of issues following this full example to get things running on GCP and ended up building my config up from the smaller GCP example.

denmarc · March 8, 2024, 2:30pm

Any development on this issue? I am facing the same problem. Ray Tune used to be very helpful but now it is quite useless on GCP.

denmarc · September 11, 2024, 9:55pm

I am still looking for a solution to this problem.

Sam_Chan · September 12, 2024, 3:49am

@denmarc what ray version are you on; and can you supply a repro script and some dets on your setup on GCP?

denmarc · October 10, 2024, 8:00pm

Hi @Sam_Chan, I updated to the latest version (2.37.0) and the previous problem seems to be solved, now Ray will launch and communicate with the worker nodes, but their initialization still seems to be faulty, as I reported here. It seems no worker commands are being executed but for the first one or two nodes.

Topic		Replies	Views
Troubles setting up a Ray Cluster on the Google Cloud Platform (GCP) Ray Core	2	557	March 3, 2021
[GCP] Ray Cluster on GCP scales up very slowly Ray Clusters	1	633	December 14, 2021
Worker nodes fail to setup container Ray Clusters	1	709	September 12, 2022
Problems lauching gcp cluster Ray Core	4	731	July 7, 2022
Ray (Tune) v2.8 - Instability with workers on GCP Ray Tune	2	140	September 12, 2024

GCP Cluster Worker Nodes fail to Initialize

Related topics