AWS Nodes Stuck in Launching When Upscaling after Downscaling

mdagost · August 2, 2022, 9:57pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m working with a cluster of on-demand nodes. I’ve noticed that when the cluster downscales, or when a node dies, the autoscaler seems unable to actually launch new nodes. ray monitor shows the workers as pending / launching indefinitely:

Healthy:
 1 ray_head
Pending:
 ray_gpu_worker, 5 launching
Recent failures:
 (no failures)

In the AWS console, I see the head node but there is absolutely nothing happening with the pending worker nodes.

It’s not some AWS setting, since I can launch a fresh cluster just fine. And the cluster autoscales up just fine. I also don’t know where to get to find any more detailed logs to help diagnose.

Dmitri · August 3, 2022, 6:11pm

Hi @mdagost. This sounds like a bug.
Are you able to reproduce this behavior each time the cluster downscales a node?
Would you mind opening a bug report issue on the Ray GitHub and tagging me (@DmitriGekhtman) on the issue?

mdagost · August 4, 2022, 9:22pm

Ticket opened here: [Core] Autoscaled nodes stuck in launching status · Issue #27515 · ray-project/ray · GitHub

Topic		Replies	Views
Ray cluster's worker node is pending Ray Clusters	2	1215	February 8, 2022
Submitting job to remote AWS cluster Ray Core	3	236	April 5, 2024
Autoscaler launches extra nodes Ray Clusters	0	371	June 14, 2023
Autoscaler not shutting down idle nodes. ray 1.3 Ray Clusters	20	1297	June 9, 2021
Starting up ray cluster on AWS EC2 instance Ray Clusters	4	1092	April 2, 2024

AWS Nodes Stuck in Launching When Upscaling after Downscaling

Related topics