Ray hangs in 2 different places, fails to launch anything on workers in ssh mode

Ray 2.3.1 (with [default])

I’m trying to launch a simple cluster with 1 worker I can ssh to:

cluster_name: default
type: local
head_ip: box1
worker_ips: [box2]
min_workers: 1
upscaling_speed: 1.0
idle_timeout_minutes: 1.0
file_mounts: {}
file_mounts_sync_continuously: False
- source venv/bin/activate && ray stop && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --include-dashboard 1 --dashboard-host
- date > /home//RAY_DID_A_THING # just to check if anything ran on the worker
- source venv/bin/activate && ray stop && ray start --address=$RAY_HEAD_IP:6379`

Note that venv is accessible on both machines.

I’m encountering the following issues, running on the head node:

  1. When I try to run ray down -vvvvvvvv -y ray-config.yaml without first running ray stop, it hangs forever with no information:

$ ray down -vvvvvvvv -y ray-config.yaml
Loaded cached provider configuration from /tmp/ray-config-f218cd32e988d445b2bd4b3e6b2682e034aec662
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]

This is extra confusing because --no-config-cache is not an option for ray down!

  1. When I ray stop and try to ray up -vvvvvvvvv -y --no-config-cache ray-config.yaml again, it does not connect to the worker. I can confirm this by checking the dashboard, checking the nonexistence of RAY_DID_A_THING, or trying to run a job. If I run a job, it hangs and ultimately fails with

TimeoutError: Placement group creation timed out after 100 seconds. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {...box1...}


Am I doing something wrong? Might this be a bug with Ray?