TensorFlow allocates all available memory on the GPU in the first trial, leading to no space left for running additional trials in parallel

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi there,

Identified challenge
I’ve been trying to use Ray Tune in combination with Tensorflow (2.4.1) with a single GPU, but I’ve been running into a challenge regarding TensorFlow allocating all available memory on the GPU to the first trial, leading to no space being available to run additional trials in parallel. Specifically, during the first trial, TensorFlow allocates all available memory on the GPU, which leads to there being no space available to run additional trials in parallel (as a single trial allocates all available memory on the GPU). This is undesirable behavior.

Solution to identified challenge
TensorFlow allocating all available memory on the GPU is a commonly understood behavior in the TensorFlow community that can be mitigated by enabling memory growth (for details, see https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth).

Additional challenge introduced by solution
I found that enabling memory growth in the main call of a Python script that instantiates a Ray Tune experiment is insufficient, at least when using the Trainable Class API in Ray Tune. The reason for this is that a new TensorFlow instance seems to be started for each trial that overrides the memory growth I enabled previously.

Solution to additional challenge
Hence, in order to prevent TensorFlow from allocating all available memory for each new trial, I managed to successfully enable memory growth by (1) importing TensorFlow and (2) enabling memory growth in the setup method of the Trainable class. Aside from this being a not-so-clean solution, I also have other methods in my Trainable class that require TensorFlow. However, due to me only importing TensorFlow in the setup method (and not in the other methods that also require TensorFlow), I’m running into some syntax errors. It’s possible to resolve this error by importing TensorFlow in each method, but this feels quite dirty.

Due to the widespread usage of TensorFlow, I figured there should be a cleaner solution to this problem. However, I was unable to find any mention of this in the documentation and also no concrete solutions in other questions on GitHub/Discuss.

Any form of help resolving this memory allocation issue with TensorFlow is greatly appreciated. Thank you in advance.

I found that the cleanest way to avoid entire memory allocation with TensorFlow is by (1) importing TensorFlow globally (i.e., at the top of the Python file) and (2) enabling memory growth once in the setup method of the Trainable class. See below for a minimum working example.

# Note: TF can be imported globally instead of in all methods
import tensorflow as tf
from ray import air, tune

class ExampleTrainable(tune.Trainable):
    def _load_data(self):
        (x_train, y_train), (x_test, y_test) = ...
	    return (x_train, y_train), (x_test, y_test)

    def _build_model(self):
	    # Note: TF is now available in all methods w/o syntax errors
	    model = ...
        return model

    def setup(self, config):
	    # Note: crucial for avoiding TF from allocating the entire memory
	    gpus = tf.config.experimental.list_phyiscal_devices("GPU")
	    for gpu in gpus:
		    tf.config.experimental.set_memory_growth(gpu, True)

        self.train_data, self.test_data = self._load_data()
	    self.model = self._build_model()
        self.model.compile(...)

    def step(self):
        # Fit model and return metrics


if __name__ == "__main__":
    tune.Tuner(
        tune.with_resources(
            ExampleTrainable
            resources={"cpu": 1, "gpu": 1},
        ),
    )
    results = tuner.fit()

Thanks @Cysto this is very helpful for other users.

We may want to confirm the problem, solution, and add a utility to do this automatically - cc @amogkam