Cluster and node health check mechanism?

virtualluke · January 21, 2021, 7:20pm

Is there a mechanism to determine the status of a ray cluster’s health. I know that when a driver program crashes that I often need to restart the cluster or I get unintended consequences. I would like some way to know if a ray cluster is in a good state or not. Ideas? What do other people do?

I was thinking of monitoring all of the ray processes on each node but didn’t want to go that route without checking on better ways.

thanks,
Luke

Alex · January 22, 2021, 1:33am

That sounds reasonable for getting dead or alive information. If you’re interested in slightly more comprehensive health information (or want something more robust against raylet crashes/restarting) you could consider using the prometheus metrics. Ray Monitoring — Ray v2.0.0.dev0

rliaw · January 22, 2021, 1:40am

@virtualluke what would be your ideal interface for monitoring cluster health?

Maybe something like kubectl but showing all ray services across all nodes?

virtualluke · January 22, 2021, 2:05am

Something like kubectl would be nice.

My immediate need is to know whether I need to recreate/restart a ray cluster because it is in a bad state. So programmatically being able to see that and auto restart a cluster to get to a healthy cluster to be able to then submit a job would be great. Right now on a few of our ray clusters outside of k8s we manage start and stop with ansible (ansible job to do a ray start on all the nodes but with a -head on the head node basically). Would like a way to know whether to do a stop/start programmatically to get a fresh clean happy ray cluster.

thanks,
Luke

Topic		Replies	Views
Checking if Ray is alive on Kubernetes Dashboard, Monitoring & Debugging	0	687	July 2, 2021
How do I troubleshoot nodes that remaining uninitialized? Ray Clusters	7	1523	February 20, 2024
Health check failed due to missing too many heartbeats Ray Clusters	0	311	July 17, 2024
Monitoring hardware utilization of workers Dashboard, Monitoring & Debugging	7	985	March 8, 2022
Is there a programmatic way to find where jobs are actually running? Ray Clusters	1	196	January 18, 2024

Cluster and node health check mechanism?

Related topics