Is there a mechanism to determine the status of a ray cluster’s health. I know that when a driver program crashes that I often need to restart the cluster or I get unintended consequences. I would like some way to know if a ray cluster is in a good state or not. Ideas? What do other people do?
I was thinking of monitoring all of the ray processes on each node but didn’t want to go that route without checking on better ways.
That sounds reasonable for getting dead or alive information. If you’re interested in slightly more comprehensive health information (or want something more robust against raylet crashes/restarting) you could consider using the prometheus metrics. Ray Monitoring — Ray v2.0.0.dev0
My immediate need is to know whether I need to recreate/restart a ray cluster because it is in a bad state. So programmatically being able to see that and auto restart a cluster to get to a healthy cluster to be able to then submit a job would be great. Right now on a few of our ray clusters outside of k8s we manage start and stop with ansible (ansible job to do a ray start on all the nodes but with a -head on the head node basically). Would like a way to know whether to do a stop/start programmatically to get a fresh clean happy ray cluster.