Automaticly choose the most free GPU

Ilnur786 · August 14, 2023, 11:25am

I have 2 gpu on the machine and how to choose the most free GPU for each run? I wrapped predict func with @ray.remote(num_gpus=1, num_cpus=8) decorator, wrote func, which shows the most free GPU and set it through os.environ[‘CUDA_VISIBLE_DEVICES’] = str(gpu_id). When the most free GPU is changed and a new instance of model loading on another GPU, ray releases model instances in the previous GPU. How to solve this problem?

Update: as a default, ray chooses GPU with 0 id, even if was sat CUDA_VISIBLE_DEVICES=0,1 and @ray.remote(num_gpus=2, num_cpus=8, max_calls=1)

XIE · August 15, 2023, 5:33am

cc: @yic could you take a quick look?

Ilnur786 · August 15, 2023, 4:18pm

I should say that I’m doing this on aws ec2 machine and the model is one of the hugging face transformers

yic · August 17, 2023, 10:28pm

@Ilnur786 could you give me a script to show what do you mean by most free GPU? IIUC, Ray only treat GPU as logic resource and doesn’t check ‘most free’ GPU.

I’ll be nice if you can have a script showing what’s going wrong and what’s expected.

Ilnur786 · August 21, 2023, 4:08pm

Hello. I’m getting info about gpus free space by this func, which return List[int]:

def get_gpu_memory():
    command = "nvidia-smi --query-gpu=memory.free --format=csv"
    memory_free_info = sp.check_output(command.split()).decode('ascii').split('\n')[:-1][1:]
    memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
    return memory_free_values

then, I set the GPU index in os.environ[‘CUDA_VISIBLE_DEVICES’] = gpu_id. It worked, but when some tasks are already was running on gpu:0 and the next job should be run on gpu:1 (because it was the most free GPU at this moment), ray released resources from gpu:0 which lead killing the tasks on it.
I chose ray, because struggled from that I wasn’t able to release resources after huggingface transformer, but, unfortunately, ray doesn’t give the opportunity to notice gpu index exactly.
I solved the problem with can’t releasing resources after task finishing by running the task in another process by multiprocessing and fortunately, hugging face model gives opportunity to choose gpu index
P.S. I tried to use ray and give gpu index to model, but this schema wasn’t work

Ilnur786 · August 29, 2023, 10:47am

Some updates for future visitors: I changed the model to a pure torch one and met the same issue. It can be because of either the task management system (dramatiq in my case) or amazon ec2 machine. In most other cases, I think there should be an opportunity to release resources with standard methods: move the model and other tensors to the CPU, delete variables, and clean torch cache. If not, use ray or run the model calculation in another process.

Topic		Replies	Views
Ray worker GPU count if GPU available Ray Core	2	904	August 1, 2022
Intentionally not using GPU Ray Core	3	398	February 9, 2022
How to specify GPU resources in terms of GPU RAM and not fraction of GPU Ray Core	3	580	November 26, 2021
Automatic calculation of a value for the `num_gpu` param Ray Core	4	925	December 2, 2022
Passing multiple GPUs to ray.multiprocessing.Pool Ray Core	4	901	October 4, 2022

Automaticly choose the most free GPU

Related topics