Automatic calculation of a value for the `num_gpu` param

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

Let me start this post by saying GPU memory is most precious than anything else in the world! Next, this question is a continuation of my previous question. Please see my simple actor below:

import ray
import torch

ray.init()
ray.cluster_resources() 

@ray.remote(num_gpus=0.5)
class Counter(object):
    def __init__(self):
        self.tensor = torch.ones((1, 3))
        self.device = "cuda:0"

    def move_and_increment(self):
        self.tensor.to(self.device)
        self.tensor += 1

    def print(self):
        return self.tensor


print(f"torch.cuda.is_available(): {torch.cuda.is_available()}")

counters = [Counter.remote() for i in range(2)]
[c.move_and_increment.remote() for c in counters]
futures = [c.print.remote() for c in counters]
print(ray.get(futures))

ray.shutdown()

I have 1 Nvidia GeForce RTX 2080 (8GB Memory) and the above code works fine in it. However, please notice the num_gpus=0.5 parameter in my actor. I have the following 2 questions about num_gpu parameter:

  1. In my simple program, the actor and main function are in the same place. Furthermore, the number of actors is a handful in number. Both of these situations make it very easy to update the num_gpus parameter. But how do you edit this parameter (and others, say num_cpu, etc.) in a large project having multiple files?
  2. Consider having an RTX 3090 having 24GB GPU memory. and a tiny tensor. In this case, if I allocate, use num_gpu=1 (instead of 0.5) and run two actors. Shouldn’t ray automatically find free memory on the GPU and then allocate the second actor to the same GPU to save resources? Therefore, I can run a large number of actors in a GPU.

In summary, is there a way for automatic calculation of a value for the num_gpu parameter?

Any suggestions, please?

hi Ravi,

Unfortunately today Ray doesn’t have the capability of automatic calculation of the num_gpu for each task. There has been some work going on to avoid task_memory OOM, but no GPU memory support yet.

To override the gpu requirment, you can use ray.options to override the gpu requirement, following this guild Miscellaneous Topics — Ray 2.0.0

Hey Ravi, a few more suggestions:

But how do you edit this parameter (and others, say num_cpu , etc.) in a large project having multiple files?

If I understand your question correctly, you can dynamically override the number of resources a task/actor takes by calling .option().

As an example, your script can be updated to launch 5 counters that each require 2 CPUs and 0.2 GPUs by changing the following:

- counters = [Counter.remote() for i in range(2)]
+ counters = [Counter.options(num_cpus=2, num_gpus=0.2).remote() for i in range(5)]

Shouldn’t ray automatically find free memory on the GPU and then allocate the second actor to the same GPU to save resources?

This is a bit complicated since Ray isn’t aware of how much GPU memory is available on the GPU or how much each task/actor needs. If you are able to estimate this ahead of time, you can do something like:

gpu_fraction = estimated_actor_gpu_memory/single_gpu_memory
counters = [Counter.options(num_gpus=gpu_fraction).remote() for i in range(num_counters)]

Thanks @ravi for raising the question!

Looks to me that the Dynamic Resource setting option that @Chen_Shen and @matthewdeng suggested is the direction to go for your use case. Let us know if it works