AWS Nodes Stuck in Launching When Upscaling after Downscaling

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m working with a cluster of on-demand nodes. I’ve noticed that when the cluster downscales, or when a node dies, the autoscaler seems unable to actually launch new nodes. ray monitor shows the workers as pending / launching indefinitely:

Healthy:
 1 ray_head
Pending:
 ray_gpu_worker, 5 launching
Recent failures:
 (no failures)

In the AWS console, I see the head node but there is absolutely nothing happening with the pending worker nodes.

It’s not some AWS setting, since I can launch a fresh cluster just fine. And the cluster autoscales up just fine. I also don’t know where to get to find any more detailed logs to help diagnose.

Hi @mdagost. This sounds like a bug.
Are you able to reproduce this behavior each time the cluster downscales a node?
Would you mind opening a bug report issue on the Ray GitHub and tagging me (@DmitriGekhtman) on the issue?

Ticket opened here: [Core] Autoscaled nodes stuck in launching status · Issue #27515 · ray-project/ray · GitHub