How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I’m working with a cluster of on-demand nodes. I’ve noticed that when the cluster downscales, or when a node dies, the autoscaler seems unable to actually launch new nodes.
ray monitor shows the workers as pending / launching indefinitely:
Healthy: 1 ray_head Pending: ray_gpu_worker, 5 launching Recent failures: (no failures)
In the AWS console, I see the head node but there is absolutely nothing happening with the pending worker nodes.
It’s not some AWS setting, since I can launch a fresh cluster just fine. And the cluster autoscales up just fine. I also don’t know where to get to find any more detailed logs to help diagnose.