Ray hangs in 2 different places, fails to launch anything on workers in ssh mode

mwlon · April 21, 2023, 9:44pm

Ray 2.3.1 (with [default])

I’m trying to launch a simple cluster with 1 worker I can ssh to:

cluster_name: default
provider:
type: local
head_ip: box1
worker_ips: [box2]
auth:
ssh_user:
min_workers: 1
upscaling_speed: 1.0
idle_timeout_minutes: 1.0
file_mounts: {}
cluster_synced_files:
file_mounts_sync_continuously: False
initialization_commands:
setup_commands:
head_setup_commands:
worker_setup_commands:
head_start_ray_commands:
- source venv/bin/activate && ray stop && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --include-dashboard 1 --dashboard-host 0.0.0.0
worker_start_ray_commands:
- date > /home//RAY_DID_A_THING # just to check if anything ran on the worker
- source venv/bin/activate && ray stop && ray start --address=$RAY_HEAD_IP:6379`

Note that venv is accessible on both machines.

I’m encountering the following issues, running on the head node:

When I try to run ray down -vvvvvvvv -y ray-config.yaml without first running ray stop, it hangs forever with no information:

$ ray down -vvvvvvvv -y ray-config.yaml
Loaded cached provider configuration from /tmp/ray-config-f218cd32e988d445b2bd4b3e6b2682e034aec662
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]

This is extra confusing because --no-config-cache is not an option for ray down!

When I ray stop and try to ray up -vvvvvvvvv -y --no-config-cache ray-config.yaml again, it does not connect to the worker. I can confirm this by checking the dashboard, checking the nonexistence of RAY_DID_A_THING, or trying to run a job. If I run a job, it hangs and ultimately fails with

TimeoutError: Placement group creation timed out after 100 seconds. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {...box1...}

=============================

Am I doing something wrong? Might this be a bug with Ray?

Topic		Replies	Views
Launching or bringing down a bare metal cluster hangs indefinitely Ray Clusters	0	226	September 27, 2023
Worker nodes stuck in "waiting-for-ssh" Ray Clusters	8	1738	July 6, 2022
Not able to ssh into head node during ray up Ray Clusters	3	1862	June 17, 2022
Ray cluster-launcher not starting up properly Ray Clusters	3	119	March 6, 2025
"ray up yaml" cannot connect to worker node without error info Ray Tune	1	391	November 30, 2021

Ray hangs in 2 different places, fails to launch anything on workers in ssh mode

Related topics