How to configure a Ray cluster to have actor/task source code and avoid pickling overhead?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

We are making great use of the Ray core libraries at the moment, specifically in the use of long-lived Actors to hold state and avoid the overhead of consistently loading large amounts of data from external services. One of the main issues that we’re running up against is the overhead that we encounter when creating actors for the first time, as the act of pickling/serializing the actor class and sending it from the client to the Ray cluster can sometimes take a long time, like 20-120 seconds to complete. I have a few questions about this:

  1. Let’s say that I have an actor class that’s embedded within a large and complex codebase with complex dependencies. When creating an instance of this actor to be run on the remote cluster, what precisely is being serialized and sent over with that class to the cluster? If the actor is embedded within a complex web of dependencies (3rd party and 1st party), are all of those serialized as well and sent over? Is the actor class the only thing serialized and it’s assumed that any other dependencies are already installed on the workers on the Ray cluster?
  2. Is there a way for me to place all relevant actor code on the Ray cluster (on the workers), and when spinning up an actor, have Ray simply tell the cluster to use the already existing source code as the actor template instead of trying to pickle the code on the client and send it over?

Pickling client code to be sent and executed on the remote cluster is a cool technique that makes toy usage and experimentation with Ray very simple, but it seems to get in the way of serious production-grade usage with complex code bases and complex dependency graphs. It would be cool if we could avoid this model all together and simple rely on the source code already existing on the remote cluster.

This feature is not supported for now. One reason is that, python is dynamic and we are not sure whether the actor’s code has been changed since loading or not.
So we have hashes and push it to a global store. It’ll be cached locally and stored locally. There do is some function to make it load from local, but I don’t think that’s a robust code path. (no one is using that).

I’m curious that it’ll take 20-120 seconds to complete. is your function so big? We shouldn’t put big data into the functions actually. It’ll be better if you put data into object store and pass object ref around. Could you tell me more about your usecase?

@ylc thanks for responding! The long wait times aren’t from us passing large amounts of data around, they occur when we have actors that have complex dependencies like 1st party imports at the top of the file or complex type annotations used for the methods on the actor. We’ve noticed that having some sort of complex type annotation on the actor (a 1st-party class that depends on a lot of other 1st-party modules) can cause the first instantiation of that method or the instantiation of that actor to take a looong time, dozens of seconds. Changing the type annotation to Any or passing around simple dictionaries instead of 1st-party types reduces the overhead drastically.

This, naively, seems to us to have something to do with the pickling/serialization strategy used when transferring the actor over to the cluster. What does pickle (or cloudpickle, dill, etc) do when the class being pickled depends on other modules? Are those modules also pickled?

I see. I think this means we probably should optimize this logic in some way. Is it possible to ask you sharing some minimal reproducible scripts? I’d like to give a look at what’s happening there.