GCP Cluster Worker Nodes fail to Initialize

How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.

Hi, as of last few weeks I have had increasing number of errors faced during spool up of worker-nodes on GCP by the Ray autoscaler. It has gotten to the point where I basically can’t scale clusters past a handful of worker nodes. I am able to reproduce this using an almost exact duplicate of the Ray example YAML for GCP, a script that runs dummy remote tasks, and across new conda envs on the local node launching the cluster with Ray versions 2.3.1, 2.7.0, 2.7.1. The consistency across Ray versions makes me think some sort of issue with a recent GCP change. Things I have tried to fix this issue include:

-as mentioned, trying multiple Ray versions on new environments
-different GCP instance types
-different GCP instance images with differnent Python versions
-pip installing various versions of google-python-api-client in cluster init commands

My monitor.err in all cases is full of Exceptions related to read time-outs and SSL Errors. It’s worth noting that a co-worker ran the same job on AWS with a similar config and had no issues scaling to 50-nodes in 6 mins, while I can’t get past 5 nodes after 1hr+. This bug has made Ray unusable for me so any help would be very much appreciated!

Logs from four different attempts - Ray Error Logs - Google Drive

YAML config -

Dummy Python payload that I ran on head node -

@Vrushank_Desai could you try filling-in and un-commenting this section?:

    # serviceAccounts:
    # - email: ray-autoscaler-sa-v1@<project_id>.iam.gserviceaccount.com
    #   scopes:
    #   - https://www.googleapis.com/auth/cloud-platform

I had a variety of issues following this full example to get things running on GCP and ended up building my config up from the smaller GCP example.

Any development on this issue? I am facing the same problem. Ray Tune used to be very helpful but now it is quite useless on GCP.

I am still looking for a solution to this problem.

@denmarc what ray version are you on; and can you supply a repro script and some dets on your setup on GCP?

Hi @Sam_Chan, I updated to the latest version (2.37.0) and the previous problem seems to be solved, now Ray will launch and communicate with the worker nodes, but their initialization still seems to be faulty, as I reported here. It seems no worker commands are being executed but for the first one or two nodes.