Automatic calculation of a value for the `num_gpu` param

ravi · October 11, 2022, 3:57am

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity

Let me start this post by saying GPU memory is most precious than anything else in the world! Next, this question is a continuation of my previous question. Please see my simple actor below:

import ray
import torch

ray.init()
ray.cluster_resources() 

@ray.remote(num_gpus=0.5)
class Counter(object):
    def __init__(self):
        self.tensor = torch.ones((1, 3))
        self.device = "cuda:0"

    def move_and_increment(self):
        self.tensor.to(self.device)
        self.tensor += 1

    def print(self):
        return self.tensor


print(f"torch.cuda.is_available(): {torch.cuda.is_available()}")

counters = [Counter.remote() for i in range(2)]
[c.move_and_increment.remote() for c in counters]
futures = [c.print.remote() for c in counters]
print(ray.get(futures))

ray.shutdown()

I have 1 Nvidia GeForce RTX 2080 (8GB Memory) and the above code works fine in it. However, please notice the num_gpus=0.5 parameter in my actor. I have the following 2 questions about num_gpu parameter:

In my simple program, the actor and main function are in the same place. Furthermore, the number of actors is a handful in number. Both of these situations make it very easy to update the num_gpus parameter. But how do you edit this parameter (and others, say num_cpu, etc.) in a large project having multiple files?
Consider having an RTX 3090 having 24GB GPU memory. and a tiny tensor. In this case, if I allocate, use num_gpu=1 (instead of 0.5) and run two actors. Shouldn’t ray automatically find free memory on the GPU and then allocate the second actor to the same GPU to save resources? Therefore, I can run a large number of actors in a GPU.

In summary, is there a way for automatic calculation of a value for the num_gpu parameter?

ravi · October 15, 2022, 2:28pm

Any suggestions, please?

Chen_Shen · October 15, 2022, 6:11pm

hi Ravi,

Unfortunately today Ray doesn’t have the capability of automatic calculation of the num_gpu for each task. There has been some work going on to avoid task_memory OOM, but no GPU memory support yet.

To override the gpu requirment, you can use ray.options to override the gpu requirement, following this guild Miscellaneous Topics — Ray 2.0.0

matthewdeng · October 15, 2022, 10:17pm

Hey Ravi, a few more suggestions:

But how do you edit this parameter (and others, say num_cpu , etc.) in a large project having multiple files?

If I understand your question correctly, you can dynamically override the number of resources a task/actor takes by calling .option().

As an example, your script can be updated to launch 5 counters that each require 2 CPUs and 0.2 GPUs by changing the following:

- counters = [Counter.remote() for i in range(2)]
+ counters = [Counter.options(num_cpus=2, num_gpus=0.2).remote() for i in range(5)]

Shouldn’t ray automatically find free memory on the GPU and then allocate the second actor to the same GPU to save resources?

This is a bit complicated since Ray isn’t aware of how much GPU memory is available on the GPU or how much each task/actor needs. If you are able to estimate this ahead of time, you can do something like:

gpu_fraction = estimated_actor_gpu_memory/single_gpu_memory
counters = [Counter.options(num_gpus=gpu_fraction).remote() for i in range(num_counters)]

zhz · December 2, 2022, 6:14am

Thanks @ravi for raising the question!

Looks to me that the Dynamic Resource setting option that @Chen_Shen and @matthewdeng suggested is the direction to go for your use case. Let us know if it works

Topic		Replies	Views
How to specify GPU resources in terms of GPU RAM and not fraction of GPU Ray Core	3	594	November 26, 2021
[Ray Core] RuntimeError: No CUDA GPUs are available Ray Core	5	4987	October 15, 2022
Run Python function in parallel on GPU Ray Core	10	4672	January 28, 2022
Spread accross several fractional GPUs or 1< num_gpus < 2 Ray Core	1	350	February 13, 2024
How to define `num_gpus` in `ray.remote()` while not explicitly adding `@ray.remote` above the target class Ray Core	2	171	April 16, 2024

Automatic calculation of a value for the `num_gpu` param

Related topics