How to wait for GPU memory to be released when using TensorFlow in a ray remote function

EdoCha · January 19, 2024, 3:51pm

Hello,

I’m defining a ray remote function that is doing some tensorflow stuff on the GPU. I saw that I need to use max_calls=1 in the function decorator such that the worker thread is killed and GPU memory released after the function is complete. I do see this working fine if I wait a bit (e.g. 1 second) after the function call is done before I try to do some more stuff with tensorflow reallocating GPU memory. But if I don’t wait, then probably the worker is not killed yet when I try to do some more stuff and the memory isn’t released yet.
Now, waiting for an abitrary amount of time is definitelly not something one wants to do, what I want to do is wait until I get some kind of confirmation that the worker is killed.
How can I do that ?

Some pseudo code:

@ray.remote(num_cpus=1, num_gpus=1, max_calls=1)
def use_tf():
    # use tensorflow to do some stuff and then return some results...    
    tf.model.load()
    return 1

# call the ray remote which uses tensorflow and get the result from it
obj_ref=use_tf.remote()
result=ray.get(obj_ref)

# wait a bit for gpu memory to be released
# this is what I want to replace by some code that would wait until the worker that executed the use_tf() function is shutdown and GPU memory has been released instead...
time.wait(2)

# do some tensorflow stuff using the GPU
tf.model.load()

sangcho · January 25, 2024, 12:31pm

Maybe use ray.available_resources and check if gpus are available?

If you do other tensorflow thing in other worker, you can also enforce this by using num_gpus=1

Topic		Replies	Views
GPU memory not cleared after trial Ray Tune	3	1034	January 18, 2022
GPU Memory not clearing after one Ray tune task	2	452	September 14, 2023
GPU memory management Ray Core	4	494	November 10, 2021
Automaticly choose the most free GPU Ray Core	5	408	August 29, 2023
GPU memory not being freed every other trial in Ray Tune	3	723	February 21, 2023

How to wait for GPU memory to be released when using TensorFlow in a ray remote function

Related topics