Autoscaler does not seem to watch head node

I have installed k8 Ray cluster using Ray up.
In one of my experiments a head node failed due to out of memory. It looks like nothing is watching this node, because, it keeps sitting in this state forever. Should autoscaler watch this node as well?
Also Ray down, fails to clean up any pod not in ready state. This pods have to be deleted manually.

Are those bugs?

Also, ray down does not remove the service associated with head pod. Also Ray up sometimes does create this service, but sometimes it does not.

@Dmitri could you clarify the behavior here please?

Currently, Ray down does not delete the service and Ray up creates the service if it’s not already present.
There’s an issue open to change this behavior so that Ray down deletes the service

This so

Ray down should remove all pods created by Ray up, regardless of their status. If this didn’t work for you, it would be great if you could file an issue on the Ray github with bug reproduction details!

Unfortunately, we currently don’t implement error handling to deal with Ray head failure. In fact, when launching clusters on Kubernetes with the cluster launcher (ray up), the autoscaler runs on the head node so there’s no way for the autoscaler to recover the head node.
In the future, the Ray K8s Operator will implement sensible logic to deal with head failure –
this issue is tracked here

The Ray operator runs the autoscaler in a pod separate from the Ray cluster.

Here’s the documentation on the cluster launcher and operator.

1 Like

Did you guys consider to install head node as a deployment of 1 to allow deployment to restart it in the case of failures.

1 Like