Is there a mechanism to determine the status of a ray cluster’s health. I know that when a driver program crashes that I often need to restart the cluster or I get unintended consequences. I would like some way to know if a ray cluster is in a good state or not. Ideas? What do other people do?
I was thinking of monitoring all of the ray processes on each node but didn’t want to go that route without checking on better ways.