It seems likely that the “waiting-for-ssh” status is due to Ray’s autoscaler timing out while waiting for SSH access to the worker node. However, based on the available data sources, there is no explicit user-configurable parameter in the Ray YAML or CLI to increase the SSH wait timeout for GCP clusters. The timeout for SSH is hardcoded in the Ray autoscaler code, and several users have requested making it configurable, but as of now, it is not exposed as a cluster config or environment variable (see here, here).
You may be able to work around this by ensuring your nodes are faster to initialize (e.g., use smaller images, avoid slow setup commands), or by modifying the Ray source code directly to increase the timeout. If you want to try a code change, you would need to adjust the AUTOSCALER_NODE_START_WAIT_S constant in the Ray autoscaler Python files, but this is not officially documented or supported. Would you like more detail on how to locate and modify this timeout in the Ray source code?
Sources:
Hint: Mention @RunLLM in the post for followups.