Worker initializer in ray.util.multiprocessing.Pool

I have an expensive initialization that I want each worker to perform before it starts processing tasks. The resulting object should then be accessible to all tasks on that worker.
With Actors, it is quite straightforward, I initialize it in the constructor.

However, I am not clear on how to do it with tasks, and how to do it with the ray.util.multiprocessing.Pool.

Specifically, the initializer in the Pool API is not returning any value. The pattern I know from the python Pool is to assign to a global variable, which is then accessible also from the worker task. However, this does not seem to work with the ray Pool. What is the intended usage pattern for initializers in the ray multiprocessing Pool?

cc @eoakes do you know how we can achieve this? This seems to be a common use case of our multi processing pool?

2 things come to mind

  1. In general, if you want global state, you can wrap it in an actor, then ray.get() it inside your parallelized function. The caveat is that this incurs deserialization overhead (which can be large if your object is a large, non-array-like object).

  2. Use Actor Pool, which is built for this exact case. You could even wrap it and call it from your pool map function if you wanted.

I am indeed using the ActorPool now, which works well, although I need to specify in advance how many actors I will have. I was under the impression that the ray multiprocessing.Pool grows/shrinks automatically with the number of tasks and the current cluster size (ie, if I supply many tasks, it will create more actors and autoscale up). Or is it just my wishful thinking?

oh i see, i think the multiprocessing pool defaults to creating one actor per cpu in the cluster, i’m not aware of any fancy tricks there.

You’re right that ActorPool doesn’t have a way of adding actors to an existing pool right now, but it should be pretty easy to add (as long as someone is willing to implement it). Do you mind filing a github feature request?

2 Likes

@Alex is using the ActorPool still the recommended approach for situations where you want a pool of processes and you have expensive initialisation? I see it’s deprecated now.

As noted in Deprecation of ray.utils.ActorPool - #9 by ClarenceNg the actor pool is no longer deprecated

1 Like

@Yoav

regarding your questions of expensive initialization, if it is about process warming / code loading, Ray should already handle that given we do some caching / re-use of workers

Otherwise have you considered using the object store / ray.put & get at the beginning of the task?