Raise errors when ray hangs?

hahdawg · November 4, 2022, 5:36pm

Hi,

In our workflows, ray will sometimes “get stuck” soon after a ray.get call, and resource utilization will essentially be zero on our clusters. Within ray, is there any way to detect when this happens or raise errors if it does?

This type of functionality would enable us to restart the job, which usually does the trick, or delete the cluster, which would save us money.

We could create our own monitor, but I wanted to check if something like this already exists.

Thanks.

ClarenceNg · November 14, 2022, 4:14pm

One possible way to detect hang is by adding timeout to ray.get , e.g. Ray Core API — Ray 2.1.0

Would this work for your use case?

Topic		Replies	Views
Ray 1.11.0 sometimes hangs between ray.get calls Ray Core	5	294	March 31, 2022
Diagnose issue where ray stops working but no error gets raised? Ray Core	1	275	January 15, 2022
How to identify cause of "Exit signal" killing active worker Ray Core	5	879	January 6, 2022
Launching Cluster on AWS hangs Ray Clusters	2	578	May 3, 2023
Detecting a resource deadlock Ray Core	1	278	October 27, 2023

Raise errors when ray hangs?

Related topics