Launching or bringing down a bare metal cluster hangs indefinitely

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello. I have the following cluster configuration:

cluster_name: distmat

provider:
    type: local
    head_ip: "192.168.128.129"
    worker_ips: ["192.168.128.211", "192.168.128.212", "192.168.128.213", "192.168.128.214", "192.168.128.215", "192.168.128.221", "192.168.128.222", "192.168.128.223", "192.168.128.224", "192.168.128.225"]
auth:
    ssh_user: <<user>>

upscaling_speed: 1.0

idle_timeout_minutes: 5

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

head_start_ray_commands:
    - ray stop
    - ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379

I’ve been facing an issue where running ray up cluster.yaml, with or without the --no-config-cache flag, or ray down cluster.yaml, results in the command hanging indefinitely until I manually terminate it with Ctrl+C. Strangely, this problem emerged unexpectedly, as the same setup used to function correctly without issues.

If you have any insights or solutions to this problem, I’d be glad if you’d share them.
Thanks