Your YAML config is correct in structure, but the issue of worker nodes not starting is a common one with Ray local clusters. This often happens due to SSH connectivity issues, Docker not running or not installed on the worker nodes, or network/firewall problems. The Ray autoscaler relies on passwordless SSH from the head node to all worker nodes, and Docker must be available and accessible to the user on each worker. Also, ensure all required ports are open between nodes. This problem has been reported by others using similar configs, and sometimes restarting the cluster multiple times or manually starting Ray on workers temporarily resolves it, but it’s not a permanent fix. See detailed discussions and troubleshooting steps in this GitHub issue and this forum thread.
If you want more detail on debugging steps (e.g., SSH checks, Docker status, log locations), let me know.
Sources:
- https://github.com/ray-project/ray/issues/39565
- https://discuss.ray.io/t/ray-cluster-worker-nodes-stuck-at-uninitialized/10533
Hint: Mention @RunLLM in the post for followups.