Passing multiple GPUs to ray.multiprocessing.Pool

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m trying to parallelize the training of a pytorch model with ray.multiprocessing.Pool on a server with two GPUs. Since other people also use the server, I wrote a function that checks which one of the two GPUs has more free RAM available and then puts the training on that GPU. Using Pool is convenient for me because it automatically splits the list of configuration files that I pass into like-sized chunks.

However, I can’t get ray to make both GPUs available inside the function. Right now I’m bypassing the problem with os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1" inside the function, but it’s somewhat unsatisfactory.

Below is a short script illustrating the issue:

  • When I’m using a remote function and pass num_gpus=2, everything works as expected and both GPUs are detected from inside the function.
  • When I’m using Pool with ray_remote_args={"num_gpus": 2}, nothing gets executed and the program is stuck with Warning: The following resource request cannot be scheduled right now: {‘GPU’: 2.0}.

Is there a way to make both GPUs available for Pool without explicitly setting os.environ["CUDA_VISIBLE_DEVICES"] from within my function?

import os
import ray
from ray.util.multiprocessing import Pool

def use_gpu(x):
    # os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1" # Works as expected
    print(f"Resources: {ray.available_resources()}")
    print(f"ray.get_gpu_ids(): {ray.get_gpu_ids()}")
    print(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")

@ray.remote(num_gpus=2)
def use_gpu_remote(x):
    print(f"Resources: {ray.available_resources()}")
    print(f"ray.get_gpu_ids(): {ray.get_gpu_ids()}")
    print(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")

if __name__ == "__main__":
    ray.init(num_cpus=2, num_gpus=2)

    # Using remote function: works
    # iterable=[0, 1, 2, 3, 4, 5]
    # futures = [use_gpu_remote.remote(x) for x in iterable]
    # results = ray.get(futures)

    # Using pool: doesn't work (Warning: The following resource request cannot be scheduled right now: {'GPU': 2.0}.)
    pool = Pool(processes=2, ray_remote_args={"num_gpus": 2})
    pool.map(func=use_gpu, iterable=[0, 1, 2, 3, 4, 5])
    pool.close()
    pool.join()

Thanks @X3N4 for the detailed question and the repro!

Not an expert in this matter, I will try ask around as well. But looking at the source code of ray.util.multiprocessing.py, it seems to me that ray_remote_args is applied to each proccess or actor, so I wonder if you could try this instead:

    pool = Pool(processes=2, ray_remote_args={"num_gpus": 1})

Since you have 2 processes and 2 GPUs, where each one should only have 1? This is probably why the error message as well since Ray was trying to schedule 4 GPUs where it only has 2.

Thanks for your reply!

Since you have 2 processes and 2 GPUs, where each one should only have 1?

Both processes should have both GPUs available. Running your code makes one GPU available to each process but I have no control over it. I want to assign the GPU I want to actually run the task on inside the function based on current workload. Note that in my actual use case I have more like 16 processes for these two GPUs.

Seems like that’s the problem. But why does it work for remote functions then in my example? Shouldn’t it also try to schedule more GPUs than are available? The argument is the same after all. I also tried fractional scaling but then of course only the first GPU is ever used.

I feel like there should be a more “ray” way of doing this instead of hacking around with environment variables.

Ah I see, so you want all the actors be able to see all the resources, and manage the resources by them? I don’t think this is currently possible with ray’s resource model where ray only schedules based on the number of GPU available.

A couple of follow up questions from me:

  1. But what you really want to achieve is some sort of GPU scheduling based on load metric? Would you mind sharing a bit on how the current workload will affect your scheduling?

  2. If you have 16 processes, and 2 GPUs, will it work if you simply specify num_gpus=1/8 for each actor, and let Ray decides the actor scheduling?

Seems like that’s the problem. But why does it work for remote functions then in my example? Shouldn’t it also try to schedule more GPUs than are available?

The remote function works because as use_gpu_remote is called and scheduled in with the iterable, in each iteration it is able to have 2 GPUs available, runs the code, and finishes. So use_gpu_remote(0), use_gpu_remote(1) , … use_gpu_remote(5) are scheduled and run serialized. So the highest GPU count needed to run them is only 2 GPUs.

However, with the actor pool, it would actually try reserving 2 actors each with 2 GPUs, so it needs 4 GPUs to run them now.

But yeah, I could see where the confusion is - we should definitley improve the documentation for the Pool. Opening an issue to track here.

1 Like

Thank you so much for helping me understand this better Ricky!

Sure, I’ll just show you the snippet (using pynvml to read out the stats):

os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1"

nvsmi = nvidia_smi.getInstance()
# Get available memory for each GPU
all_stats = nvsmi.DeviceQuery("memory.free")["gpu"]
# Get only the stats of the GPUs that were specified in the config with their correct index
selected_stats = {i: all_stats[i]["fb_memory_usage"]["free"] for i in set(run_config.gpus)}
# Select the GPU with the maximum free memory available
gpu = max(selected_stats, key=selected_stats.get)  # type: ignore[arg-type]
device = set_cuda_configuration(gpu)

The reasoning behind the code above is that both our GPUs are also accessible other users and I don’t want to always check manually which one I should use for my code. Loads might also change during execution of the script, which may take a long time to run if there are many experiments (week or longer).

AFAIK, ray would then distribute the Processes evenly among both GPUs. Or is there some more elaborate scheduling going on in the background?

If the processes would be distributed evenly, that’s not what I want: Someone else might be using GPU 0 heavily for instance, so all my stuff can easily run on GPU 1.

Thank you very much for your elaboration. That explains a lot.

I realize that I’m probably having an edgy use case here so I really appreciate you taking the time for clearing this up. There’s probably not more that I can do than run the code where I set the environment variables explicitly, right?