How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
We are making great use of the Ray core libraries at the moment, specifically in the use of long-lived Actors to hold state and avoid the overhead of consistently loading large amounts of data from external services. One of the main issues that we’re running up against is the overhead that we encounter when creating actors for the first time, as the act of pickling/serializing the actor class and sending it from the client to the Ray cluster can sometimes take a long time, like 20-120 seconds to complete. I have a few questions about this:
- Let’s say that I have an actor class that’s embedded within a large and complex codebase with complex dependencies. When creating an instance of this actor to be run on the remote cluster, what precisely is being serialized and sent over with that class to the cluster? If the actor is embedded within a complex web of dependencies (3rd party and 1st party), are all of those serialized as well and sent over? Is the actor class the only thing serialized and it’s assumed that any other dependencies are already installed on the workers on the Ray cluster?
- Is there a way for me to place all relevant actor code on the Ray cluster (on the workers), and when spinning up an actor, have Ray simply tell the cluster to use the already existing source code as the actor template instead of trying to pickle the code on the client and send it over?
Pickling client code to be sent and executed on the remote cluster is a cool technique that makes toy usage and experimentation with Ray very simple, but it seems to get in the way of serious production-grade usage with complex code bases and complex dependency graphs. It would be cool if we could avoid this model all together and simple rely on the source code already existing on the remote cluster.