Deprecation of ray.utils.ActorPool

M_S · September 23, 2022, 6:41am

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

Hello,

we are currently using ray.utils.ActorPool as a flexible way to build up a queue to stream data into our Deep-Learning pipeline. We did initially (a few months ago) try to do the same thing with rays Datasets, but were unable to get a “smooth” experience and instead had issues with resource utilization peaks and idleness.
The ActorPool is a strong and simple utility, that enabled us to build exactly what we needed (full and direct control over the elements that we put into queue, no need to materialize the data into some predefined format; currently we load the data from an https endpoint within the actor, not before) without much hassle. What I get from the deprecation message now is that we will be forced to use Datasets again in the future with ActorPoolStrategy, which looking at the code seems to be somewhat hidden and not easily adaptable to what we are doing right now (also it’s not a public API, so I don’t think it’s intended to be customized).

In our opinion, one of the advantages of ray are strong primitives like ActorPool and we are sad to see it go.
I understand that ray is moving in the direction of strong and stable Machine Learning APIs with AIR (which we are also using and mostly happy about), but we view those utilities that are a more low level and customizable as just as valuable.

Is there any chance the ActorPool will get a full replacement or even stay? I don’t see why it is necessary to remove it, except for strategical reasons pushing towards Datasets.

Please let me know if there is a better place to voice such feedback.

Thank you!

rickyyx · September 25, 2022, 5:38pm

Hey @M_S , thank you so much for your feedback!

The current agreed recommendation for ActorPool replacement is ray.multiprocessing. While it’s definitely not a simple drop-in replacement, we would love to learn more if ray.multiprocessing doesn’t work well for you.

In the meanwhile - ray core team would be more than happy to work with you on the dataset API migration if you are keen.

Please let us know if ray.multiprocessing sounds like an easy alternative, and we will also consider “undeprecate” the ActorPool API if there are significant enough gaps in the alternatives.

ericl · September 27, 2022, 5:09pm

This PR undeprecates ActorPool for now: Undeprecate actor pool for now by ericl · Pull Request #28818 · ray-project/ray · GitHub

@M_S, could you provide some more details / pseudocode on the workload that you’re using with ActorPool? We’d like to improve the documentation for actor pool here, and also ensure that multiprocessing / Datasets can also smoothly execute this in the future.

M_S · September 29, 2022, 7:33am

Hi @rickyyx, hi @ericl,

Thank you for undeprecating this for now!
Here is some pseudocode that describes roughly how we are using the actor pool:

class ActorPoolDataset:

  def __init__(self, items, config, random_seed, startup_parallelism, new_actor_fn):    
    self._actor_pool = ray.util.ActorPool([new_actor_fn() for _ in range(startup_parallelism)])    
    self._startup_parallelism = startup_parallelism    
    self._item_stream = _create_weighted_infinite_item_stream(items, config, random_seed)

  def __iter__(self):    item_it = iter(self._item_stream)    
    # Enqueue some items ahead.    
    for i in range(self._startup_parallelism):      
      self._actor_pool.submit(lambda actor_, value: actor_.make.remote(value), next(item_it))        

    for item in item_it:      
      self._actor_pool.submit(lambda actor_, value: actor_.make.remote(value), item)      
      ...  # Some extra/optional code to add/remove actors (scaling the pool size).      
      yield from self._actor_pool.get_next_unordered()

    # Fetch pending elements after all have been enqueued.    
    while self._actor_pool.has_next():      
      yield from self._actor_pool.get_next_unordered()

Contraints on our side:

Item stream is created from finite set of items, but with a custom weighed repeated sampling → actual stream is infinite.
Actors (actor function make is stateful (needs to advance random state of the actor).

Why not ray.datasets:

no convenient support for handling the infinite data stream (dataset.repeat and dataset.shuffle not sufficient because we want to control the sampling weights).
When trying to prototype this with datasets, we found it hard to tune the blocksize, windowsize and parallism. We ended up with long warm up times and/or spiking, uneven CPU load. The actor pool (with pre-enqueue) gave a much simpler, easier to understand interface for controlling resources and pipeline - it was overall faster with an even load and moderate startup time.

Why not multiprocessing:

The actors are stateful. Each time remote make is called, the random state needs to be advanced.

Possible alternatives:

Support stateful actors for multiprocessing.
Support easy wrapping of infinite iterators in datasets and make the tuning of preloading/parallism work more intuitively (the docs/guide are lacking practical advice how to do this right. We even did a brute-force grid-search benchmarking and couldn’t make sense of why some combinations had high throughput and others not).
Expose the autoscaling ActorPoolStrategy functionality used by datasets in a public API, s.t. we can control the actorpool in the background as we are doing it in the pseudocode above.

rickyyx · October 3, 2022, 4:26pm

@M_S Thank you so much for the detailed feedback on this - I have saved it and tracked it for future discussion.

Andrea_Pisoni · November 7, 2022, 10:55pm

What about cases where multiprocessing would work, but initialization is super heavy? Wouldn’t actor pool work better?

ClarenceNg · November 15, 2022, 7:02am

Yes, keep actors around will help keep the processes cached, there are also ways to increase parallelism via asyncio / concurrency AsyncIO / Concurrency for Actors — Ray 2.1.0

Andrea_Pisoni · November 15, 2022, 7:14am

Thanks Clarence, but ActorPool is deprecated, so what’s the best practice for cases when you want a Pool but the initialisation is expensive?

ClarenceNg · November 15, 2022, 9:09am

@Andrea_Pisoni it is no longer deprecated per Undeprecate actor pool for now by ericl · Pull Request #28818 · ray-project/ray · GitHub

Topic		Replies	Views
Actorpool pipelining Ray Core	2	461	June 19, 2023
maximize the parallelization efficiency using Python ray ActorPool?	4	731	November 15, 2022
Ray ActorPool with 2 actors for Tensorflow resent-50 prediction is not performance better than single actor pool Ray Core	0	309	December 11, 2021
Passing ActorPool to another ActorPool/Workflow Ray Core	2	366	March 7, 2023
Dataset Pipelines - Window deprecated? Ray Data	2	191	August 29, 2024

Deprecation of ray.utils.ActorPool

Related topics