Multi-model batch inference - problem with scaling of actors

Hello everyone, we’re building a multi-model batch inference pipeline and so far, we’re enjoying Ray a lot! However, I’ve noticed one problem with auto scaling when using stateful maps (with Actors). Our pipeline looks as follows:

read_parquet → Preprocess (load image etc., on CPU) → annotate (on GPU) → preprocess → annotate → … → write_parquet

Say we have 3 models in this pipeline and we want to use autoscaling, so we set concurrency to (1,16) for every map operator. Ray will then in the beginning assign many GPUs to the first annotator and less for 2nd and 3rd. This will use the resources fully, so scaling algorithm will be satisfied. But the pipeline will be unbalanced: first annotator will output much more records than 2nd and 3rd can consume. Backpressure won’t be triggered, because the rows are very small and don’t take much space in the object store, and every worker has a long queue of blocks to process. But if the job crashes, all the progress of the first annotators will be lost, because Ray doesn’t have resuming feature.

Is there a way to configure Ray to scale down some workers in this situation, or is such a feature planned?

1. Severity of the issue:
Low: Annoying but doesn’t hinder my work.

2. Environment:

  • Ray version: 2.48.0
  • Python version: 3.10.11
  • OS: Linux
  • Cloud/Infrastructure: 4 nodes x 4xA100s
  • Other libs/tools (if relevant): -

3. What happened vs. what you expected:

  • Expected: Ray scales down actors if a pipeline is unbalanced
  • Actual: Ray never scales down actors as long as they’re busy and there’s space in object store.

So after reading some similar discussions in the Ray GitHub, Ray’s actor pool autoscaling is sensitive to transient usage, and does not consider usage over time intervals or downstream backpressure, which can cause the scaling behavior you’ve seen. There are open discussions about making the autoscaler more robust and responsive to such pipeline imbalances, but as of Ray 2.48.0, this feature is not available… I think there are some workarounds but I don’t know off the top of my head. :frowning:

It does seem like definitely an area of improvement though.

Discussions I talked about:

1 Like

Thanks for your answer, my problem is a bit different than in the GitHub threads you posted, in the first one the problem is that it’s scaling up and down too much, and in the second, some resources are not used and for some reason Ray can’t create more actors. My problem is that all resources are utilized fully, and it can’t scale down, which causes imbalance and the work will not be saved in case of a crash.