GPU Memory Aware Scheduling

Hi, I surfed the entire web but couldn’t find the answer to my query. So, I’m posting it here.

I’m planning on using Ray for the purpose of deploying trained models in production pipelines. I know the upper bound of the GPU memory that my trained models are going to consume during inference.
Is it possible to specify the “GPU memory requirements” before-hand in @ray.remote in order to maximize the GPU utilization and throughput?

In case you’re going to suggest num_gpus: I have a couple of GPU specifications. For instance, there is a system with 2 GPUs of 8GB each, one with 1 GPU of 24GB, etc.

If inference is going to take, let’s say 4GB GPU, how do I specify it before-hand (GPU memory aware scheduling)?


Would it be a possibility for you to use custom resources?
You could add e.g. GPUMemory as a custom resource when starting up the clusters. Than you can use this by stating how much GPU your tasks will take when running.

I’m not aware of something similar being available out of the box.

Custom resources are definitely helpful in this regard. Thank you. But if I specify custom resources only, then my inference is not going on the GPU and it crashes with the reason “no CUDA capable device detected.”

that is strange, if all your none gpu nodes do not have the resource, the tasks should never be scheduled there, however I think it would make sense specify both the custom resource and the gpu resource, because no matter how much memory it will take, it will always occupy some part of the gpu as well and I think it makes sense to tell ray that there are less gpu available because some of them are already be used. E.g. if you have another task which does not specify the gpu memory but only gpu, it might be scheduled on a node which is already full.

Do you have an example how you do that?

Let’s say, I have 2 GPUs with following specs:

  1. 8GB memory
  2. 24GB memory

I have a trained model which takes at most 4 GB of GPU during inference.

Since the infer function needs to be executed over GPU, I must declare “num_gpus”. If I do not declare it, the process does not run over GPU and gives “no CUDA capable device detected.”

Practically, I have total GPU of 32 GB, which means I can run 8 inferences in parallel.

How should the @ray.remote decorator for such a function should look like?
If I specify it as:

@ray.remote(num_gpus=0.5, resources={“GPUMemory”: 4})
def infer(…):

It will be a best fit case for 8GB GPU, but my other 24GB GPU will be under utilized, as this will allocate 12GB GPU (out of 24GB, as per 0.5 num_gpus) for a process that is going to take just 4GB GPU.

Hope I’m able to explain.


Can you tell how you are starting the nodes? Important part here is how you are configuring the custom resources there.

If you are not caring at all about the gpu usage, (which is normally more important for training than for inference I assume), than you could also say @ray.remote(num_gpus=0.01, resources={“GPUMemory”: 4}).

I never tested to use custom resources without standard resources like gpu, but refering to this Using ray with gpus, I asume you only get the “no CUDA capable device detected.” because ray hides them, so you could by yourself set the CUDA_VISIBLE_DEVICES and your code should be able to run again.

But I would recommend using a small gpu fraction instead, and completely rely on GPUMemory aware scheduling it is just easier :smiley:

1 Like

@eoakes not sure who is involved in gpu aware scheduling and if there is a better solution at the moment. however if this is not yet possible, maybe it makes sense to bring that up for a discussion?

We also have the case that our deployments are more sensitive to gpu memory than to gpu usage, and that percentage is not the best solution as e.g. piplines/test environments have often different resources than production environments. I could imaging @rohitgpt is not the only person facing this :smiley:

1 Like

Another possibility (similar to custom resources), is to take advantage of the automatically added “AcceleratorType” resource. The autoscaler automatically detects several types of GPUs (e.g., V100) and tags them as "“AcceleratorType:V100”.

So you could request half of a V100 GPU as follows:

@ray.remote(num_gpus=0.5, resources={“AcceleratorType:V100”: 0.01})
def f():

This isn’t as flexible as the custom resources though, since it will require a specific GPU type.