(launch failed) Worker nodes on AWS get to setup and stall, won't shutdown via ray

John_Just · October 8, 2023, 5:14am

High: It blocks me to complete my task.

I’m pretty sure this is a security ports issue…but need confirmation/help.

I’m using AWS and my organization requires security groups with restrictive requirements. For any ports besides like port 22 and a couple others (443, 80, etc), I’m restricted in creating security group ingress/inbound rules in that I have to specify the exact IP address of the source.

Trying to use autoscalar. I start ray with “ray up …” and up comes a head node. I have a security group on the head node that allows the computer I’m working from to read from any port on the head node.

When I run an example script that spins up worker nodes, they get fired up and connecting via port 22 ssh (from head node) all goes fine. Then comes setup and they get stuck (as I’m watching the dashboard). Eventually I get this output…

“Removing 2 nodes of type ray.worker.default (launch failed).”

I never actually see these nodes doing any work on the dashboard (in the cluster tab they never show up…only in the overview tab), and they don’t terminate automatically either even though Ray says the launch failed.

Do these worker nodes need to be able to read from any port on the Ray head node (thus I’d need an inbound security rule for each worker)…or what else might be going on here?

John_Just · October 8, 2023, 5:33am

Nevermind…was able to get this working by updating some security group rules as I found a slightly more flexible option with the ingress rules (so can close this). As I suspected it was an issue with the inbound rules.

Topic		Replies	Views
Head node fails to ssh into worker nodes Ray Clusters	6	1936	August 17, 2022
Ray cluster one worker plasma is N/A Ray Clusters	9	443	June 14, 2022
Worker nodes stuck in "waiting-for-ssh" Ray Clusters	8	1718	July 6, 2022
Worker node fails to launch AWS Ray Serve	2	27	May 9, 2025
Ray head stuck on ssh when implementing Cloudwatch Ray Clusters	1	15	March 28, 2025

(launch failed) Worker nodes on AWS get to setup and stall, won't shutdown via ray

Related topics