Unable to start Ray cluster in GCP VM

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 2.42.1
  • Python version: 3.10
  • OS: Ubuntu 22.04.5 LTS
  • Cloud/Infrastructure: GCP
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: ray up config.yaml should shart the cluster.
  • Actual: ray up config.yaml is stuck at waiting-for-ssh

config.yaml is fairly simple, provided below:

cluster_name: ds
max_workers: 2
upscaling_speed: 1.0
docker:
  image: rayproject/ray:latest-cpu
  container_name: "ray_container"
  pull_before_run: True
  run_options:
    - --ulimit nofile=65536:65536
idle_timeout_minutes: 5
provider:
    type: gcp
    region: us-central1
    availability_zone: us-central1-f
    project_id: xxx-xxx-xxx
auth:
    ssh_user: ubuntu
available_node_types:
    ray_head_default:
        resources: {"CPU": 2}
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
            networkInterfaces:
              - kind: compute#networkInterface
                subnetwork: projects/xxx-xxx-xxx/regions/us-central1/subnetworks/xxx-xxx-01
                networkIP: "10.230.230.200"
    ray_worker_small:
        min_workers: 1
        max_workers: 2
        resources: {"CPU": 2}
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
            scheduling:
              - preemptible: true
            networkInterfaces:
              - kind: compute#networkInterface
                subnetwork: projects/xxx-xxx-xxx/regions/us-central1/subnetworks/xxxx-xxxxxxx-01
head_node_type: ray_head_default
file_mounts: {}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
    - "**/.git"
    - "**/.git/**"
rsync_filter:
    - ".gitignore"
initialization_commands:
   - docker pull rayproject/ray:latest-cpu
setup_commands: []
head_setup_commands:
  - pip install google-api-python-client==1.7.8
worker_setup_commands: []
head_start_ray_commands:
    - ray stop
    - >-
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
    - ray stop
    - >-
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Commands run in sequence:

python -m pip install -U "ray[all]"==2.42.1
python -m pip install google-api-python-client==2.161.0
ray up config.yaml

I have tried ray 2.45 as well, and also the same setup in python 3.11. In all the cases he provisioning is stuck at waiting-for-ssh

I have tried the provisioning command on local machine, which has Owner permissions to the GCP project, and then on the GCP VM itself, which has the Owner permissions to keep things at this point.

Output of the command:

$ ray up config.yaml -vvvv --no-config-cache
Cluster: data-science-automation

2025-05-05 16:18:01,580	INFO util.py:382 -- setting max workers for head node type to 0
Checking GCP environment settings
2025-05-05 16:18:01,750 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-05-05 16:18:01,752 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-05-05 16:18:01,839 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-05-05 16:18:01,840 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-05-05 16:18:02,469 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-05-05 16:18:02,470 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-05-05 16:18:02,695	INFO config.py:650 -- _configure_key_pair: Private key not specified in config, using/home/biswalc/.ssh/ray-autoscaler_gcp_us-central1_xxx-xxx-xxx_ubuntu_0.pem
2025-05-05 16:18:02,760 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-05-05 16:18:02,761 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
Updating cluster configuration and running full setup.
Cluster Ray runtime will be restarted. Confirm [y/N]: y

Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

<1/1> Setting up head node
  Prepared bootstrap config
2025-05-05 16:18:14,383	INFO node.py:349 -- wait_for_compute_zone_operation: Waiting for operation operation-xxx-xxx-xxx-c093126c to finish...
2025-05-05 16:18:19,628	INFO node.py:368 -- wait_for_compute_zone_operation: Operation operation-xxx-xxx-xxx-xxx finished.
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Waiting for IP
      Not yet available, retrying in 5 seconds
      Not yet available, retrying in 5 seconds
      Not yet available, retrying in 5 seconds
      Not yet available, retrying in 5 seconds

Hello, can you confirm that the GCP API has a public IP + port that is accessible by Ray? Does your GCP config have an external / public address configured? It seems like Ray is struggling to reach / ssh into that endpoint.

Hello Christina, Thank you for looking into this. No, the VM doesnt have public IP, I want to use Internal/Private IP.
In the config, I have the internal IP fixed for head node at 10.230.230.200.

I am able to ssh into the head machine from the VM creating the cluster.

Oh I see! So if you’re trying to use internal/private IP can you try enabling provider.use_internal_ips in your config? Cluster YAML Configuration Options — Ray 2.45.0 Maybe that could help? Then you just run Ray on the host that can reach your port at 10.230.230.200.

Thank you this is a good resource.

1 Like