I have installed k8 Ray cluster using Ray up.
In one of my experiments a head node failed due to out of memory. It looks like nothing is watching this node, because, it keeps sitting in this state forever. Should autoscaler watch this node as well?
Also Ray down, fails to clean up any pod not in ready state. This pods have to be deleted manually.
Ray down should remove all pods created by Ray up, regardless of their status. If this didn’t work for you, it would be great if you could file an issue on the Ray github with bug reproduction details!
Unfortunately, we currently don’t implement error handling to deal with Ray head failure. In fact, when launching clusters on Kubernetes with the cluster launcher (ray up), the autoscaler runs on the head node so there’s no way for the autoscaler to recover the head node.
In the future, the Ray K8s Operator will implement sensible logic to deal with head failure –
this issue is tracked here
The Ray operator runs the autoscaler in a pod separate from the Ray cluster.
Here’s the documentation on the cluster launcher and operator.