GPU Memory Aware Scheduling

rohitgpt · July 18, 2021, 9:24am

Hi, I surfed the entire web but couldn’t find the answer to my query. So, I’m posting it here.

I’m planning on using Ray for the purpose of deploying trained models in production pipelines. I know the upper bound of the GPU memory that my trained models are going to consume during inference.
Is it possible to specify the “GPU memory requirements” before-hand in @ray.remote in order to maximize the GPU utilization and throughput?

In case you’re going to suggest num_gpus: I have a couple of GPU specifications. For instance, there is a system with 2 GPUs of 8GB each, one with 1 GPU of 24GB, etc.

If inference is going to take, let’s say 4GB GPU, how do I specify it before-hand (GPU memory aware scheduling)?

TIA.

TanjaBayer · July 18, 2021, 10:18am

Would it be a possibility for you to use custom resources?
You could add e.g. GPUMemory as a custom resource when starting up the clusters. Than you can use this by stating how much GPU your tasks will take when running.

I’m not aware of something similar being available out of the box.

rohitgpt · July 20, 2021, 9:58am

Custom resources are definitely helpful in this regard. Thank you. But if I specify custom resources only, then my inference is not going on the GPU and it crashes with the reason “no CUDA capable device detected.”

TanjaBayer · July 21, 2021, 7:17am

that is strange, if all your none gpu nodes do not have the resource, the tasks should never be scheduled there, however I think it would make sense specify both the custom resource and the gpu resource, because no matter how much memory it will take, it will always occupy some part of the gpu as well and I think it makes sense to tell ray that there are less gpu available because some of them are already be used. E.g. if you have another task which does not specify the gpu memory but only gpu, it might be scheduled on a node which is already full.

Do you have an example how you do that?

rohitgpt · July 21, 2021, 11:22am

Let’s say, I have 2 GPUs with following specs:

8GB memory
24GB memory

I have a trained model which takes at most 4 GB of GPU during inference.

Since the infer function needs to be executed over GPU, I must declare “num_gpus”. If I do not declare it, the process does not run over GPU and gives “no CUDA capable device detected.”

Practically, I have total GPU of 32 GB, which means I can run 8 inferences in parallel.

How should the @ray.remote decorator for such a function should look like?
If I specify it as:

@ray.remote(num_gpus=0.5, resources={“GPUMemory”: 4})
def infer(…):
…

It will be a best fit case for 8GB GPU, but my other 24GB GPU will be under utilized, as this will allocate 12GB GPU (out of 24GB, as per 0.5 num_gpus) for a process that is going to take just 4GB GPU.

Hope I’m able to explain.

TIA.

TanjaBayer · July 21, 2021, 12:59pm

Can you tell how you are starting the nodes? Important part here is how you are configuring the custom resources there.

If you are not caring at all about the gpu usage, (which is normally more important for training than for inference I assume), than you could also say @ray.remote(num_gpus=0.01, resources={“GPUMemory”: 4}).

I never tested to use custom resources without standard resources like gpu, but refering to this Using ray with gpus, I asume you only get the “no CUDA capable device detected.” because ray hides them, so you could by yourself set the CUDA_VISIBLE_DEVICES and your code should be able to run again.

But I would recommend using a small gpu fraction instead, and completely rely on GPUMemory aware scheduling it is just easier

TanjaBayer · July 21, 2021, 1:05pm

@eoakes not sure who is involved in gpu aware scheduling and if there is a better solution at the moment. however if this is not yet possible, maybe it makes sense to bring that up for a discussion?

We also have the case that our deployments are more sensitive to gpu memory than to gpu usage, and that percentage is not the best solution as e.g. piplines/test environments have often different resources than production environments. I could imaging @rohitgpt is not the only person facing this

ericl · August 3, 2021, 8:41pm

Another possibility (similar to custom resources), is to take advantage of the automatically added “AcceleratorType” resource. The autoscaler automatically detects several types of GPUs (e.g., V100) and tags them as "“AcceleratorType:V100”.

So you could request half of a V100 GPU as follows:

@ray.remote(num_gpus=0.5, resources={“AcceleratorType:V100”: 0.01})
def f():
pass

This isn’t as flexible as the custom resources though, since it will require a specific GPU type.

psydok · March 12, 2024, 8:42am

Thank you for your response. I’m a bit of a latecomer…

I have the same problem as in the discussion. I have multiple nodes with multiple gpu’s of different sizes.

In that case I need to specify GPUMemory: 16 if I have 2 gpus 8 Gb on one of the nodes. On the other node GPUMemory: 24, which is 2 gpus - 12Gb.
I assume that there will be a problem if I specify GPUMemory: 10, because from the side of the ray it will seem that there is enough memory on both the first and the second node.

Could you please advise me how to deal with this case?

Topic		Replies	Views
How to specify GPU resources in terms of GPU RAM and not fraction of GPU Ray Core	3	580	November 26, 2021
Is it possible to run inference on local GPU as well as rollout CPU workers?	1	261	November 2, 2023
Specifying extra resources for functions (tasks) running inside an Actor? Ray Core	2	353	September 27, 2023
Ray Actor not utilising GPU Ray Core	7	260	November 6, 2024
What will happen if a dataset is sent to gpu which may not have enough mem? Ray Core	5	267	March 28, 2023

GPU Memory Aware Scheduling

Related topics