I’m defining a ray remote function that is doing some tensorflow stuff on the GPU. I saw that I need to use max_calls=1 in the function decorator such that the worker thread is killed and GPU memory released after the function is complete. I do see this working fine if I wait a bit (e.g. 1 second) after the function call is done before I try to do some more stuff with tensorflow reallocating GPU memory. But if I don’t wait, then probably the worker is not killed yet when I try to do some more stuff and the memory isn’t released yet.
Now, waiting for an abitrary amount of time is definitelly not something one wants to do, what I want to do is wait until I get some kind of confirmation that the worker is killed.
How can I do that ?
Some pseudo code:
@ray.remote(num_cpus=1, num_gpus=1, max_calls=1)
# use tensorflow to do some stuff and then return some results...
# call the ray remote which uses tensorflow and get the result from it
# wait a bit for gpu memory to be released
# this is what I want to replace by some code that would wait until the worker that executed the use_tf() function is shutdown and GPU memory has been released instead...
# do some tensorflow stuff using the GPU