In our workflows, ray will sometimes “get stuck” soon after a
ray.get call, and resource utilization will essentially be zero on our clusters. Within ray, is there any way to detect when this happens or raise errors if it does?
This type of functionality would enable us to restart the job, which usually does the trick, or delete the cluster, which would save us money.
We could create our own monitor, but I wanted to check if something like this already exists.