I have 2 gpu on the machine and how to choose the most free GPU for each run? I wrapped predict func with @ray.remote(num_gpus=1, num_cpus=8) decorator, wrote func, which shows the most free GPU and set it through os.environ[‘CUDA_VISIBLE_DEVICES’] = str(gpu_id). When the most free GPU is changed and a new instance of model loading on another GPU, ray releases model instances in the previous GPU. How to solve this problem?
Update: as a default, ray chooses GPU with 0 id, even if was sat CUDA_VISIBLE_DEVICES=0,1 and @ray.remote(num_gpus=2, num_cpus=8, max_calls=1)
cc: @yic could you take a quick look?
I should say that I’m doing this on aws ec2 machine and the model is one of the hugging face transformers
@Ilnur786 could you give me a script to show what do you mean by most free GPU? IIUC, Ray only treat GPU as logic resource and doesn’t check ‘most free’ GPU.
I’ll be nice if you can have a script showing what’s going wrong and what’s expected.
Hello. I’m getting info about gpus free space by this func, which return List[int]:
command = "nvidia-smi --query-gpu=memory.free --format=csv"
memory_free_info = sp.check_output(command.split()).decode('ascii').split('\n')[:-1][1:]
memory_free_values = [int(x.split()) for i, x in enumerate(memory_free_info)]
then, I set the GPU index in os.environ[‘CUDA_VISIBLE_DEVICES’] = gpu_id. It worked, but when some tasks are already was running on gpu:0 and the next job should be run on gpu:1 (because it was the most free GPU at this moment), ray released resources from gpu:0 which lead killing the tasks on it.
I chose ray, because struggled from that I wasn’t able to release resources after huggingface transformer, but, unfortunately, ray doesn’t give the opportunity to notice gpu index exactly.
I solved the problem with can’t releasing resources after task finishing by running the task in another process by multiprocessing and fortunately, hugging face model gives opportunity to choose gpu index
P.S. I tried to use ray and give gpu index to model, but this schema wasn’t work
Some updates for future visitors: I changed the model to a pure torch one and met the same issue. It can be because of either the task management system (dramatiq in my case) or amazon ec2 machine. In most other cases, I think there should be an opportunity to release resources with standard methods: move the model and other tensors to the CPU, delete variables, and clean torch cache. If not, use ray or run the model calculation in another process.