Ray Actor failover

blshao84 · August 20, 2021, 3:11am

Hi there, I have a use case that multiple workers share data through a single Actor and my ray cluster is running on top of a kubernetes cluster. This single actor is deployed to ray cluster when the cluster is set up. It works fine most of the time, but occasionally I find that single Actor might crash in which case I have to manually redeploy it. Is there any way to configure either ray or kubernetes cluster to auto restart the actor in case of failure?

sangcho · August 20, 2021, 5:31am

Yes! We have a failover API supported; Fault Tolerance — Ray v2.0.0.dev0

blshao84 · August 20, 2021, 5:42am

Thanks! Will give it a try

blshao84 · August 20, 2021, 7:23am

Hi @sangcho, I have a follow up question. In my previous use case, if the actor holds references of data I put into Ray’s distributed object store (using ray.put), in case of failover, is there any way to recover those references (I guess data is still in plasma store)?

blshao84 · August 20, 2021, 8:29am

By the way, from this thread: Get actor handle by ObjectRef it seems ray.objects api returns all shared objects, but I can’t find this api in latest ray. Is it still supported? If not, is there any other way to achieve what ray.objects does?

sangcho · August 20, 2021, 6:19pm

Hi @sangcho, I have a follow up question. In my previous use case, if the actor holds references of data I put into Ray’s distributed object store (using ray.put), in case of failover, is there any way to recover those references (I guess data is still in plasma store)?

By our reference counting protocol, if the owner (the first worker that creates the object) is dead, the data in the plasma store is destroyed. Unfortunately, in this case, you need to implement check point to recover them and put again. For task-based workload, we support fault tolerance for objects now (Fault Tolerance — Ray 3.0.0.dev0).

By the way, from this thread: Get actor handle by ObjectRef it seems ray.objects api returns all shared objects, but I can’t find this api in latest ray. Is it still supported? If not, is there any other way to achieve what ray.objects does?

We put more efforts recently to stabilize APIs, and this API is deprecated. What’s your use case with this API? If you just want to see what objects are in the cluster, there’s a CLI called ray memory that displays all object information in the cluster.

blshao84 · August 21, 2021, 1:58am

Here’s my use case.

I have a client-side app submitting calculation requests first and later triggering the calculation. I use Ray Serve to deploy an actor that receives client requests. This actor has two jobs: 1. accept calculation requests and cache them (using this actor’s own memory); 2. trigger a batch of calculation requests and send them to Ray grid for computation (using Ray’s remote function).

It works but when the number of requests cached goes up, from time to time, I see the RayServe actor crashed due to out of memory.My client-side has some retry mechanism and my whole Ray cluster is running on top of a Kubenetes cluster which will bring up a new actor instance in case of failure. However, those calculation requests are gone when the new instance is up. That’s why I post this question for advice. I can think of several approaches:

Simply increase the actor’s memory. In my case since my cluster is running on top of a Kubenetes cluster, I know we could use different configs for head and worker but not sure whether we could config workers differently.
Use Ray’s Plasma store to cache my calculation requests. But per @sangcho 's comments, it seems not an option now, since those data would be GC-ed when the original actor is dead.
Use local file to persist pricing requests. Performance might be an issue (though I haven’t tested it yet), but also since I’m running on Kubenetes, the new actor instance might run on a different container.
Use Redis. Since Ray cluster has a Redis, I’m thinking of using it but not sure whether it’s an “anti-pattern”.

Any comment would be appreciated.

Topic		Replies	Views
Node fault tolerance in Ray Data Ray Data	2	189	January 10, 2025
How to Cache Objects in the Object Store? Ray Core	1	1626	February 21, 2023
Fault tolerance with Ray actors Ray Core	1	353	June 23, 2021
Plasma Object Ownership, Actors and ObjectLostError Ray Core	3	438	July 29, 2021
How to recover or re-run actor task on a specific worker node after raylet crashed Ray Data	2	938	May 3, 2022

Ray Actor failover

Related topics