I noticed that tasks with fractional GPU requirements are scheduled such that they first fully use one GPU before moving on to the next one. This means that some GPUs are 100% loaded while some are staying idle.
For example, if I have 4 GPUs, and run 4 tasks with num_gpus=0.5, two of the GPUs are fully loaded with two tasks each, and two GPUs remain idle. This is not the desired behaviour for me because having 1 task per GPU is faster than having 2 tasks per GPU, and so having idling GPUs makes no sense.
Is there some option I could set to get my desired behaviour of spreading work across GPUs more evenly?
How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Thanks, I’ll try the SPREAD strategy and report back.
The reason I’m not setting num_gpus=1 is this: if I wanna train 12 networks on 4 GPUs (I’m in a PBT-like scenario), if num_gpus==1, they’ll be trained in three batches: 4, 4, 4. By setting num_gpus==0.5, I can get two batches: 8, 4. This is faster in terms of wall clock time, because training two networks on a single GPU is not two times slower than training only one network. But it is somewhat slower. So after 8 networks are finished, I want the remaining 4 to be trained as fast as possible, and that means having them on separate GPUs.
(to be clear: there are no actual “batches”, I schedule all networks to be trained simultaneously so that there wouldn’t be any synchronization barriers, but since the networks take approximately the same time to train, it’s easier for me to think in terms of “batches”)
Unfortunately, the SPREAD strategy didn’t change anything. Here’s a minimal example - I run exactly this code on a machine with 4 GPUs, and the tasks are scheduled only on the first two GPUs (two tasks on each GPU). If I increase the number of tasks, the rest of the GPUs are used, so the GPUs are clearly available to Ray.
Could you advise me on what I should try next?
import time
import torch
import ray
@ray.remote(num_gpus=0.5, scheduling_strategy="SPREAD")
def fun():
torch.zeros((10, 10)).cuda()
time.sleep(5)
ray.init()
futures = [fun.remote() for i in range(4)]
print([ray.get(f) for f in futures])
@AwesomeLemon
Unfortunately, it’s a known behavior quirk when fractional GPU is involved in scheduling. For now, my suggestion is to create placement groups each containing 1 GPU, and schedule task against bundle index.
pg = placement_group([{GPU=1} * 4], strategy="SPREAD")
results = [
fun.options(
scheduling_strategy=PlacementGroupSchedulingStrategy(
placement_group=pg,
# This is the index from the original list.
# This index is set to -1 by default, which means any available bundle.
placement_group_bundle_index=0 # Index of gpu_bundle is 0.
)
).remote() for _ in range(4)
]
Unfortunately, the solution suggested by @Chen_Shen didn’t work for me, the processes are still not spread across GPUs. I tried tinkering with the solution to no avail (I set placement_group_bundle=i (instead of 0), and I had to specify the number of CPUs in each bundle (task cannot be scheduled otherwise)). Am I missing something?
pg = placement_group([{'CPU': 4, 'GPU': 1}] * 4, strategy="SPREAD")
ray.get(pg.ready())
futures = [fun.options(scheduling_strategy=PlacementGroupSchedulingStrategy(
placement_group=pg, placement_group_bundle_index=i)).remote()
for i in range(4)]
print([ray.get(f) for f in futures])
``
You observations are correct, you need to use placement_group_bundle=i and set the cpu count for each bundle. Each bundle maps to one GPU so i is essentially the GPU index.
Good to hear that I did correct changes - but as I mentioned in the previous post, it still didn’t work… Do you maybe have an idea why?
(if it helps, my ray version is 2.0.0)
You are right. Apparently I also misunderstood how placement group is implemented. It cannot achieve your goal. I’ll check with the team to see if it’s a bug.
At this point, the only way I can think of is doing the spread yourself (assuming you are single node):
@ray.remote(num_gpus=0.5)
def gpu_task(index):
# override the CUDA_VISIBLE_DEVICES set by Ray
os.environ['CUDA_VISIBLE_DEVICES'] = str(index % 4)
# actual gpu work
Is your cluster static? As a workaround for now, you can probably do a two level spread: using NodeAffinitySchedulingStrategy to spread tasks across nodes and manually overriding CUDA_VISIBLE_DEVICES to spread across GPUs in a single node.