How to Handle Fault Tolerance in Long-Running Ray Jobs?

TG_Link_Hub · November 14, 2024, 8:54pm

I have long-running distributed jobs in Ray, and I’m concerned about fault tolerance and recovery. How does Ray handle failures, and are there any recommendations for building resilient systems when working with Ray in production environments?

Topic		Replies	Views
Questions about fault tolerance in a Ray cluster Ray Clusters	0	416	December 15, 2021
Is Queue in Ray fault tolerant？ Ray Core	0	101	April 22, 2024
Newbi Question: Worker Fault Tolerance?	4	560	February 28, 2022
Ray Serve Head fault tolerance Ray Serve	3	334	October 13, 2023
Ray.data recovery/checkpointing	0	37	September 27, 2024

How to Handle Fault Tolerance in Long-Running Ray Jobs?

Related topics