How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
Hi, as of last few weeks I have had increasing number of errors faced during spool up of worker-nodes on GCP by the Ray autoscaler. It has gotten to the point where I basically can’t scale clusters past a handful of worker nodes. I am able to reproduce this using an almost exact duplicate of the Ray example YAML for GCP, a script that runs dummy remote tasks, and across new conda envs on the local node launching the cluster with Ray versions 2.3.1, 2.7.0, 2.7.1. The consistency across Ray versions makes me think some sort of issue with a recent GCP change. Things I have tried to fix this issue include:
-as mentioned, trying multiple Ray versions on new environments
-different GCP instance types
-different GCP instance images with differnent Python versions
-pip installing various versions of google-python-api-client in cluster init commands
My monitor.err in all cases is full of Exceptions related to read time-outs and SSL Errors. It’s worth noting that a co-worker ran the same job on AWS with a similar config and had no issues scaling to 50-nodes in 6 mins, while I can’t get past 5 nodes after 1hr+. This bug has made Ray unusable for me so any help would be very much appreciated!
Logs from four different attempts - Ray Error Logs - Google Drive
YAML config -
Dummy Python payload that I ran on head node -