Cluster and node health check mechanism?

Is there a mechanism to determine the status of a ray cluster’s health. I know that when a driver program crashes that I often need to restart the cluster or I get unintended consequences. I would like some way to know if a ray cluster is in a good state or not. Ideas? What do other people do?

I was thinking of monitoring all of the ray processes on each node but didn’t want to go that route without checking on better ways.

thanks,
Luke

That sounds reasonable for getting dead or alive information. If you’re interested in slightly more comprehensive health information (or want something more robust against raylet crashes/restarting) you could consider using the prometheus metrics. Ray Monitoring — Ray v2.0.0.dev0

@virtualluke what would be your ideal interface for monitoring cluster health?

Maybe something like kubectl but showing all ray services across all nodes?

Something like kubectl would be nice.

My immediate need is to know whether I need to recreate/restart a ray cluster because it is in a bad state. So programmatically being able to see that and auto restart a cluster to get to a healthy cluster to be able to then submit a job would be great. Right now on a few of our ray clusters outside of k8s we manage start and stop with ansible (ansible job to do a ray start on all the nodes but with a -head on the head node basically). Would like a way to know whether to do a stop/start programmatically to get a fresh clean happy ray cluster.

thanks,
Luke

1 Like