How to Handle Fault Tolerance in Long-Running Ray Jobs?

I have long-running distributed jobs in Ray, and I’m concerned about fault tolerance and recovery. How does Ray handle failures, and are there any recommendations for building resilient systems when working with Ray in production environments?