Rolling Upgrade of Named Actors

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

Hi,
If I deploy a named-actor(e.g. name=‘title-predictor’) via a ray job submission, I currently kill the existing named actor and then deploy the latest code with the same actor name.

The way I do it currently though, is that each named actor has it’s own runtime environment dependencies(it’s own pip requirements file). So we can have two named-actors with different conflicting dependencies on the same cluster. I understand that it’s recommended to have a single container per cluster in production to capture the environment but having a separate runtime env per actor is such a wonderful idea that we are trying to see how far we can push it in production.

However, if I update the code for that actor, re-deploying each named actor may take a few minutes while the requirements are installed for it in the cluster. So if I kill the existing actor with the name title-predictor and then redeploy via the jobs api another actor with the same name and if that redeployment takes a few minutes, then there won’t be any named actor called title-predictor for that duration? So if some client has called title_actor = ray.get_actor('title_detector') and makes a call on title_actor, it’ll fail if it is made within those few minutes between the kill of the old actor and the availability of the new actor?

So, is there a way to do a rolling upgrade of named actors? Is this a bad design smell? The way I was thinking of doing this is, the code for each actor is in a different git repo. When it’s CI/CD gets activated, it’ll redeploy the named actor in an existing ray cluster. But my challenge is that it’ll be unavailable for those few minutes as mentioned above. If there are other ways of deploying named actors, let me know.

I understand that ray serve has rolling upgrades(think I read it on the Ray Blog). That’s great but inside the same ray cluster I like the Ray Actor interface and I find it very intuitive for data scientists(and engineers) to use named actors. A data scientist does not have to worry about fastapi/serve deployments/serve run/bind for e.g.

1 Like

Hey, thanks for creating this question, it’s cool to see what people are building on Ray :slight_smile:

Have you considered having a proxy named actor? This actor would defer application logic to some “backend” actor. The proxy named actor can then wait for the new backend named actor to come online before sending work to it.

Hi @cade
Thanks for your quick response.
Just to see if I understood you properly, if I have a title_detector_v1 actor, I should create a title_detector_proxy actor in addition which will be the public proxy? This title_detector_proxy has a internal queue, temporarily saving all requests until the new backend named actor is up and will then send all the requests? or instead of a queue, the proxy will just pause for say 10 seconds before retrying again after first failure?

If I misunderstand, then please let me know if there are some literature describing this design pattern.

What I did think of is a more generic ActorRegistry(whose internals were a bit complex and I’m simplifying now with your proxy idea). Every client who wants the title detector actor will do

actor_registry = ray.get_actor("actor-registry")
title_detector_actor = actor_registry.get_actor.remote("title-detector")

get_actor in actor_registry will return the named actor or wait and retry (with a specified retry policy) if the named actor is not present. e.g. try every 30 seconds upto 10 mins and then report failure.

I still have concerns that this may cause a bottleneck for those 2-3 minutes that the actor is not present but I could be over blowing the issue. will this be an issue in a high volume environment?

I understand ray serve already has some kind of rolling upgrade and I see ray serve deployments as an external facing api to ray actors so I just thought I’ll ask if Ray Actors can have the same functionality or if it was an interesting idea to pursue for the ray community.

But for now, I’ll probably change our architecture to port all actors to ray serve deployments(perhaps generating ray serve deployments from regular python classes to make it easy for data scientists… not sure)

Just to see if I understood you properly, if I have a title_detector_v1 actor, I should create a title_detector_proxy actor in addition which will be the public proxy? This title_detector_proxy has a internal queue, temporarily saving all requests until the new backend named actor is up and will then send all the requests? or instead of a queue, the proxy will just pause for say 10 seconds before retrying again after first failure?

Almost. The proxy actor itself will not contain any application state. It merely keeps track of the actual backend actor that will handle requests, and forwards everything it receives there. Then, the deployment process looks like this:

  1. Start title_detector_v2. Wait for it to come fully online (imports, initialization, etc).
  2. Tell the proxy actor that all new traffic should go to title_detector_v2. Proxy actor now stops sending traffic to title_detector_v1, and sends all traffic to title_detector_v2.
  3. Once you have confidence that title_detector_v2 is doing fine (not crashing or anything), then you can kill title_detector_v1.

This is a pattern known as a blue-green deployment, this link might be useful if you want to learn more: What Is Blue/Green Deployment? . The benefit of this design is that if you do it right, there should be zero downtime for your title_detector actor.

I should note that these steps assume that title_detector_v1 has no necessary state to transfer to v2. It’s possible to do this but a bit more involved.

What I did think of is a more generic ActorRegistry(whose internals were a bit complex and I’m simplifying now with your proxy idea). Every client who wants the title detector actor will do

 actor_registry = ray.get_actor("actor-registry")
 title_detector_actor = actor_registry.get_actor.remote("title-detector")

get_actor in actor_registry will return the named actor or wait and retry (with a specified retry policy) if the named actor is not present. e.g. try every 30 seconds upto 10 mins and then report failure.

An actor registry idea could work too, although it is probably cleaner to separate out components so each has a single responsibility.

I still have concerns that this may cause a bottleneck for those 2-3 minutes that the actor is not present but I could be over blowing the issue. will this be an issue in a high volume environment?

I don’t know what requirements you have, so I can’t say if this will be an issue :slight_smile: If your requirements are zero downtime, then you’ll need something like blue-green deployments. Otherwise, if your users can handle a short outage during the upgrade, then this complexity isn’t warranted.

I understand ray serve already has some kind of rolling upgrade and I see ray serve deployments as an external facing api to ray actors so I just thought I’ll ask if Ray Actors can have the same functionality or if it was an interesting idea to pursue for the ray community.

But for now, I’ll probably change our architecture to port all actors to ray serve deployments(perhaps generating ray serve deployments from regular python classes to make it easy for data scientists… not sure)

I think this is a great idea. There’s a good number of folks working on making Ray Serve highly available. If you can piggy-back off of that work without doing anything yourself, I’d call that a win.

sorry for the late reply, just came back from vacation.

Thanks for the explanation on blue-green deployment! I’ll definitely keep that design choice in my arsenal. Currently, we just wrap ml models so we don’t have any state so it should be a good alternative. I’ll also investigate ray serve. Thanks!

1 Like