How to change GPU assignment strategy, from greedy to balanced?

1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.

2. Environment:

  • Ray version: 2.40.0
  • Python version: 3.10.14
  • OS: Ubuntu 22.04.3 LTS
  • Cloud/Infrastructure:
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:
I have a multi-step data processing pipeline written with ray data. There are 2 steps (let’s call them A and B) involving GPUs. In these 2, the first one uses a small model thus requires less GPU RAM and finishes earlier. Both these steps are processed with map_batches and I specify number of GPUs to use.
Let’s say I have 8 GPUs in a machine. Ideally I want to keep all of them running throughout the entire processing process: assign step A for 0.1 GPU with a concurrency of 8, assign step B for 0.9 GPU with a concurrency of 8. This way all of the 8 GPUs are always running.

  • Expected: The 8 copies of the model of both steps are assigned to one GPU each. In other words, each GPU is shared by one copy of step A model and a copy of step B model (0.1+0.9=1)
  • Actual: The 8 copies of step A model are all assigned to GPU 0.

I think it’s possible to achieve the same effect by merging the 2 steps into a single step (actor), then assign it a whole GPU.
But is it possible to use a balanced assignment instead of greedily assigning models/actors to each GPU one by one?

Hmm, I think the main way to achieve what you’re doing would be to merge everything into 1 actor.

Why do you want to do balanced assignment?

alright

In my use case, step A runs significantly faster than step B, so it finishes way earlier. With my current implementation in which one step is an actor class, when assigning 1 GPU for step A and the other 7 for step B, the GPU 0 completely idles after step A finishes. But an instance of step A actor only requires 10% GPU RAM, so with balanced assignment, for each GPU, I can assign 10% RAM for step A and 90% for step B. In this way, all GPUs will work all the time.

A potential issue of merging them into one actor class is, there’s a intermediate CPU-only and CPU-intensive processing step between A and B. I guess it may be hard to parallelize it in a merged actor class.
Another potential benefit of having a balanced strategy is it facilitates a modular implementation of processing steps, enabling easier debugging and flexibility for composition by needs.

Yeah makes sense. Other thing you could try is just allocating 1 Gpu to the smaller model and increase the batch size, and set 7 GPUs to the later stage. Technically the throughputs should balance out, but depends on your application.

That’s what I mentioned here. The defect is the speed of doing step B is approximately 7/8, since my step B is really, really slow.

Is it not a streaming operation? Do you want to post some code?

It does run in a streaming way.

Let me try again to clarify.
As I mentioned previously, step A uses a tiny and fast model, so it finishes much earlier than step B, which uses a giant and slow model. For instance, step A finishes processing all the data in 1 hour but step B needs an extra 9 hours to finish.

In this case, if i take this strategy

GPU #0 will be running only for the first 1 hour (doing step A) and idle for the rest 9 hours. So effectively the workload of step B is handled by 7 GPUs whereas there is 1 extra GPU available but not used for the 9 hours.
This is what I meant by “the speed of doing step B is approximately 7/8”.

Anyway, my current mitigation is to

  1. assign step A num_gpus=0.01 and concurrency=1
  2. assign step B num_gpus=0.95 and concurrency=8

This way, all 8 GPUs share the workload of the compute-intensive step B. it’s just not elegant and robust I presume