1. Severity of the issue: (select one)
High: Completely blocks me.
2. Environment:
- Ray version: 2.42.1
- Python version: 3.10
- OS: Ubuntu 22.04.5 LTS
- Cloud/Infrastructure: GCP
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected:
ray up config.yaml
should shart the cluster. - Actual:
ray up config.yaml
is stuck atwaiting-for-ssh
config.yaml is fairly simple, provided below:
cluster_name: ds
max_workers: 2
upscaling_speed: 1.0
docker:
image: rayproject/ray:latest-cpu
container_name: "ray_container"
pull_before_run: True
run_options:
- --ulimit nofile=65536:65536
idle_timeout_minutes: 5
provider:
type: gcp
region: us-central1
availability_zone: us-central1-f
project_id: xxx-xxx-xxx
auth:
ssh_user: ubuntu
available_node_types:
ray_head_default:
resources: {"CPU": 2}
node_config:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 100
sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
networkInterfaces:
- kind: compute#networkInterface
subnetwork: projects/xxx-xxx-xxx/regions/us-central1/subnetworks/xxx-xxx-01
networkIP: "10.230.230.200"
ray_worker_small:
min_workers: 1
max_workers: 2
resources: {"CPU": 2}
node_config:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 100
sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
scheduling:
- preemptible: true
networkInterfaces:
- kind: compute#networkInterface
subnetwork: projects/xxx-xxx-xxx/regions/us-central1/subnetworks/xxxx-xxxxxxx-01
head_node_type: ray_head_default
file_mounts: {}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
initialization_commands:
- docker pull rayproject/ray:latest-cpu
setup_commands: []
head_setup_commands:
- pip install google-api-python-client==1.7.8
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- >-
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- >-
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
Commands run in sequence:
python -m pip install -U "ray[all]"==2.42.1
python -m pip install google-api-python-client==2.161.0
ray up config.yaml
I have tried ray 2.45 as well, and also the same setup in python 3.11. In all the cases he provisioning is stuck at waiting-for-ssh
I have tried the provisioning command on local machine, which has Owner
permissions to the GCP project, and then on the GCP VM itself, which has the Owner
permissions to keep things at this point.
Output of the command:
$ ray up config.yaml -vvvv --no-config-cache
Cluster: data-science-automation
2025-05-05 16:18:01,580 INFO util.py:382 -- setting max workers for head node type to 0
Checking GCP environment settings
2025-05-05 16:18:01,750 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-05-05 16:18:01,752 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-05-05 16:18:01,839 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-05-05 16:18:01,840 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-05-05 16:18:02,469 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-05-05 16:18:02,470 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-05-05 16:18:02,695 INFO config.py:650 -- _configure_key_pair: Private key not specified in config, using/home/biswalc/.ssh/ray-autoscaler_gcp_us-central1_xxx-xxx-xxx_ubuntu_0.pem
2025-05-05 16:18:02,760 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
2025-05-05 16:18:02,761 - WARNING - httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
Updating cluster configuration and running full setup.
Cluster Ray runtime will be restarted. Confirm [y/N]: y
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
<1/1> Setting up head node
Prepared bootstrap config
2025-05-05 16:18:14,383 INFO node.py:349 -- wait_for_compute_zone_operation: Waiting for operation operation-xxx-xxx-xxx-c093126c to finish...
2025-05-05 16:18:19,628 INFO node.py:368 -- wait_for_compute_zone_operation: Operation operation-xxx-xxx-xxx-xxx finished.
New status: waiting-for-ssh
[1/7] Waiting for SSH to become available
Running `uptime` as a test.
Waiting for IP
Not yet available, retrying in 5 seconds
Not yet available, retrying in 5 seconds
Not yet available, retrying in 5 seconds
Not yet available, retrying in 5 seconds