Ray Actor failover

Hi there, I have a use case that multiple workers share data through a single Actor and my ray cluster is running on top of a kubernetes cluster. This single actor is deployed to ray cluster when the cluster is set up. It works fine most of the time, but occasionally I find that single Actor might crash in which case I have to manually redeploy it. Is there any way to configure either ray or kubernetes cluster to auto restart the actor in case of failure?

Yes! We have a failover API supported; Fault Tolerance — Ray v2.0.0.dev0

Thanks! Will give it a try

Hi @sangcho, I have a follow up question. In my previous use case, if the actor holds references of data I put into Ray’s distributed object store (using ray.put), in case of failover, is there any way to recover those references (I guess data is still in plasma store)?

By the way, from this thread: Get actor handle by ObjectRef it seems ray.objects api returns all shared objects, but I can’t find this api in latest ray. Is it still supported? If not, is there any other way to achieve what ray.objects does?

Hi @sangcho, I have a follow up question. In my previous use case, if the actor holds references of data I put into Ray’s distributed object store (using ray.put), in case of failover, is there any way to recover those references (I guess data is still in plasma store)?

By our reference counting protocol, if the owner (the first worker that creates the object) is dead, the data in the plasma store is destroyed. Unfortunately, in this case, you need to implement check point to recover them and put again. For task-based workload, we support fault tolerance for objects now (Fault Tolerance — Ray 3.0.0.dev0).

By the way, from this thread: Get actor handle by ObjectRef it seems ray.objects api returns all shared objects, but I can’t find this api in latest ray. Is it still supported? If not, is there any other way to achieve what ray.objects does?

We put more efforts recently to stabilize APIs, and this API is deprecated. What’s your use case with this API? If you just want to see what objects are in the cluster, there’s a CLI called ray memory that displays all object information in the cluster.

Here’s my use case.

I have a client-side app submitting calculation requests first and later triggering the calculation. I use Ray Serve to deploy an actor that receives client requests. This actor has two jobs: 1. accept calculation requests and cache them (using this actor’s own memory); 2. trigger a batch of calculation requests and send them to Ray grid for computation (using Ray’s remote function).

It works but when the number of requests cached goes up, from time to time, I see the RayServe actor crashed due to out of memory.My client-side has some retry mechanism and my whole Ray cluster is running on top of a Kubenetes cluster which will bring up a new actor instance in case of failure. However, those calculation requests are gone when the new instance is up. That’s why I post this question for advice. I can think of several approaches:

  1. Simply increase the actor’s memory. In my case since my cluster is running on top of a Kubenetes cluster, I know we could use different configs for head and worker but not sure whether we could config workers differently.
  2. Use Ray’s Plasma store to cache my calculation requests. But per @sangcho 's comments, it seems not an option now, since those data would be GC-ed when the original actor is dead.
  3. Use local file to persist pricing requests. Performance might be an issue (though I haven’t tested it yet), but also since I’m running on Kubenetes, the new actor instance might run on a different container.
  4. Use Redis. Since Ray cluster has a Redis, I’m thinking of using it but not sure whether it’s an “anti-pattern”.

Any comment would be appreciated.