(launch failed) Worker nodes on AWS get to setup and stall, won't shutdown via ray

  • High: It blocks me to complete my task.

I’m pretty sure this is a security ports issue…but need confirmation/help.

I’m using AWS and my organization requires security groups with restrictive requirements. For any ports besides like port 22 and a couple others (443, 80, etc), I’m restricted in creating security group ingress/inbound rules in that I have to specify the exact IP address of the source.

Trying to use autoscalar. I start ray with “ray up …” and up comes a head node. I have a security group on the head node that allows the computer I’m working from to read from any port on the head node.

When I run an example script that spins up worker nodes, they get fired up and connecting via port 22 ssh (from head node) all goes fine. Then comes setup and they get stuck (as I’m watching the dashboard). Eventually I get this output…

“Removing 2 nodes of type ray.worker.default (launch failed).”

I never actually see these nodes doing any work on the dashboard (in the cluster tab they never show up…only in the overview tab), and they don’t terminate automatically either even though Ray says the launch failed.

Do these worker nodes need to be able to read from any port on the Ray head node (thus I’d need an inbound security rule for each worker)…or what else might be going on here?

Nevermind…was able to get this working by updating some security group rules as I found a slightly more flexible option with the ingress rules (so can close this). As I suspected it was an issue with the inbound rules.