1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.
2. Environment:
Ray version: 2.40.0
Python version: 3.10.14
OS: Ubuntu 22.04.3 LTS
Cloud/Infrastructure:
Other libs/tools (if relevant):
3. What happened vs. what you expected:
I have a multi-step data processing pipeline written with ray data. There are 2 steps (let’s call them A and B) involving GPUs. In these 2, the first one uses a small model thus requires less GPU RAM and finishes earlier. Both these steps are processed with map_batches and I specify number of GPUs to use.
Let’s say I have 8 GPUs in a machine. Ideally I want to keep all of them running throughout the entire processing process: assign step A for 0.1 GPU with a concurrency of 8, assign step B for 0.9 GPU with a concurrency of 8. This way all of the 8 GPUs are always running.
Expected: The 8 copies of the model of both steps are assigned to one GPU each. In other words, each GPU is shared by one copy of step A model and a copy of step B model (0.1+0.9=1)
Actual: The 8 copies of step A model are all assigned to GPU 0.
I think it’s possible to achieve the same effect by merging the 2 steps into a single step (actor), then assign it a whole GPU.
But is it possible to use a balanced assignment instead of greedily assigning models/actors to each GPU one by one?
In my use case, step A runs significantly faster than step B, so it finishes way earlier. With my current implementation in which one step is an actor class, when assigning 1 GPU for step A and the other 7 for step B, the GPU 0 completely idles after step A finishes. But an instance of step A actor only requires 10% GPU RAM, so with balanced assignment, for each GPU, I can assign 10% RAM for step A and 90% for step B. In this way, all GPUs will work all the time.
A potential issue of merging them into one actor class is, there’s a intermediate CPU-only and CPU-intensive processing step between A and B. I guess it may be hard to parallelize it in a merged actor class.
Another potential benefit of having a balanced strategy is it facilitates a modular implementation of processing steps, enabling easier debugging and flexibility for composition by needs.
Yeah makes sense. Other thing you could try is just allocating 1 Gpu to the smaller model and increase the batch size, and set 7 GPUs to the later stage. Technically the throughputs should balance out, but depends on your application.
Let me try again to clarify.
As I mentioned previously, step A uses a tiny and fast model, so it finishes much earlier than step B, which uses a giant and slow model. For instance, step A finishes processing all the data in 1 hour but step B needs an extra 9 hours to finish.
In this case, if i take this strategy
GPU #0 will be running only for the first 1 hour (doing step A) and idle for the rest 9 hours. So effectively the workload of step B is handled by 7 GPUs whereas there is 1 extra GPU available but not used for the 9 hours.
This is what I meant by “the speed of doing step B is approximately 7/8”.
Anyway, my current mitigation is to
assign step A num_gpus=0.01 and concurrency=1
assign step B num_gpus=0.95 and concurrency=8
This way, all 8 GPUs share the workload of the compute-intensive step B. it’s just not elegant and robust I presume