How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi there,
Identified challenge
I’ve been trying to use Ray Tune in combination with Tensorflow (2.4.1) with a single GPU, but I’ve been running into a challenge regarding TensorFlow allocating all available memory on the GPU to the first trial, leading to no space being available to run additional trials in parallel. Specifically, during the first trial, TensorFlow allocates all available memory on the GPU, which leads to there being no space available to run additional trials in parallel (as a single trial allocates all available memory on the GPU). This is undesirable behavior.
Solution to identified challenge
TensorFlow allocating all available memory on the GPU is a commonly understood behavior in the TensorFlow community that can be mitigated by enabling memory growth (for details, see https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth).
Additional challenge introduced by solution
I found that enabling memory growth in the main call of a Python script that instantiates a Ray Tune experiment is insufficient, at least when using the Trainable Class API in Ray Tune. The reason for this is that a new TensorFlow instance seems to be started for each trial that overrides the memory growth I enabled previously.
Solution to additional challenge
Hence, in order to prevent TensorFlow from allocating all available memory for each new trial, I managed to successfully enable memory growth by (1) importing TensorFlow and (2) enabling memory growth in the setup
method of the Trainable class. Aside from this being a not-so-clean solution, I also have other methods in my Trainable class that require TensorFlow. However, due to me only importing TensorFlow in the setup
method (and not in the other methods that also require TensorFlow), I’m running into some syntax errors. It’s possible to resolve this error by importing TensorFlow in each method, but this feels quite dirty.
Due to the widespread usage of TensorFlow, I figured there should be a cleaner solution to this problem. However, I was unable to find any mention of this in the documentation and also no concrete solutions in other questions on GitHub/Discuss.
Any form of help resolving this memory allocation issue with TensorFlow is greatly appreciated. Thank you in advance.