Status: all CUDA-capable devices are busy or unavailable

I verified that my models are defined inside the Trainable (as suggested here) but still see this error when I run it from tune.run.

I’m using ray 1.8, TF 2.4 and CUDA 11.1


exp_path = Path("./").resolve()
train_files, test_files = transform_data(raw_csv_files) # processes the data as a pandas dfs without

class MyModel(Trainable):
    def _dense_model(self):
        self.model_dir = Path(self.logdir)/"est_model_dir"
        dnn_est = tf.estimator.DNNClassifier(
            ...
            optimizer=lambda: tf.keras.optimizers.Adam(
                learning_rate=self.config["lr"]
            )
        )
        logging.warning("Saving model data to %s" % self.model_dir)
        return dnn_est
    
    def get_data(self):
        input_fn = partial(
            tf.compat.v1.data.experimental.make_csv_dataset,
            batch_size=self.config["batch_size"],
        ...)
        self.train_input_fn = lambda: input_fn(
            file_pattern=train_files)
        self.test_input_fn = lambda: input_fn(
            file_pattern=test_files)

    def setup(self, ray_config):
        # Ray sets current directory to trials directory for every trial.
        # Ray dev suggested this change https://groups.google.com/d/msg/ray-dev/T5q7DkkGlpQ/ZzbOTIsVBwAJ
        os.chdir(exp_path)  # to make sure all files saved by me are relative to this path
        self.model = self._dense_model()
        self.get_data()

    def step(self):
        self.model.train(self.train_input_fn, steps=self.config["epochs"] * self.config["steps_per_epoch"])
        val_metrics = self.model.evaluate(self.test_input_fn, steps=self.config["steps_per_epoch"], name="val")
        train_metrics = ...
        return {**train_metrics, **val_metrics}

    def save_checkpoint(self, checkpoint_dir):
        return self.export_model(["py"]) # so I have the final model after all training

    def _export_model(self, export_formats, export_dir):
        return self.model.save(export_dir)
    

The below outside, tune.run runs on GPUs perfectly.

t = MyModel({
    ... # some config
})

t.train()

But tiggering it this way, fails with Status: all CUDA-capable devices are busy or unavailable

tune.run(
    MyModel,
    name=exp_name,
    local_dir="./ray_results/",
    resources_per_trial={'gpu': 1},
    config={
        ...
    },
    stop={"training_iteration": ...},
    max_failures=0,
    num_samples=...,
    checkpoint_at_end=True, # so the final model is exported
    callbacks=[
        MLflowLoggerCallback(
            experiment_name=exp_name,
            save_artifact=True
        )
    ],
)
  1. What am I missing in the way I’m using Trainable?
  2. How do I systematically detect any leaks (i.e. TF code written outside Trainable)?

Thank you, for helping me :slight_smile:

Nice!

Maybe try using Training (tune.Trainable, tune.report) — Ray v1.10.0 ?

Thanks for taking a look, @rliaw

  1. Why is a conditional wait_for_gpu not part of Trainable.train() by default? What was your intuition behind this? I see this behavior immediately and not during the experiment.
  2. Do you see any leaks with my usage of Trainable?

Can you try and see if it works if you remove the conditional?

Unfortunately I didn’t have much time to run the code exactly.

So concretely, can you report back your results after you try doing tune.util.wait_for_gpu() in def step()?

Also, perhaps you could consider explicitly setting tune.run(reuse_actors=False)

1 Like
== Status ==
Current time: 2022-02-13 06:07:29 (running for 00:00:50.97)
Memory usage on this node: 19.9/xxx GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/8 CPUs, 1.0/2 GPUs, 0.0/43.38 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /home/jovyan/...
Number of trials: 1/1 (1 RUNNING)

(pid=550) 2022-02-13 06:07:32,236	INFO util.py:462 -- Waiting for GPU util to reach 0.01. Util: 0.014

It failed after some time. So I changed my step() fn to tune.utils.wait_for_gpu(target_util=0.02). It continued but it failed due to the same error I opened this topic with.

(pid=347) 2022-02-13 06:21:54.914399: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
(pid=347) 2022-02-13 06:21:54.915827: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
(pid=347) 2022-02-13 06:21:54.933441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
(pid=347) pciBusID: 0000:da:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
(pid=347) coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
(pid=347) 2022-02-13 06:21:54.933483: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
(pid=347) 2022-02-13 06:21:54.937109: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
(pid=347) 2022-02-13 06:21:54.937187: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
(pid=347) 2022-02-13 06:21:54.938507: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
(pid=347) 2022-02-13 06:21:54.938785: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
(pid=347) 2022-02-13 06:21:54.942326: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
(pid=347) 2022-02-13 06:21:54.943053: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
(pid=347) 2022-02-13 06:21:54.943262: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
(pid=347) 2022-02-13 06:21:54.945908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-02-13 06:21:57,117	ERROR trial_runner.py:924 -- Trial MyModel_3b81c_00000: Error processing event.
Traceback (most recent call last):
  File "/home/jovyan/.local/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 890, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/jovyan/.local/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 788, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/jovyan/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/jovyan/.local/lib/python3.7/site-packages/ray/worker.py", line 1625, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(InternalError): ray::MyModel.train() (pid=347, ip=100.96.202.248, repr=<__main__.MyModel object at 0x7f229838e690>)
  File "/home/jovyan/.local/lib/python3.7/site-packages/ray/tune/trainable.py", line 283, in train
    result = self.step()
  File "<ipython-input-26-cc6f45585ccc>", line 92, in step
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1208, in _train_model_default
    saving_listeners)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1510, in _train_with_estimator_spec
    save_graph_def=self._config.checkpoint_save_graph_def) as mon_sess:
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 604, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1038, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 749, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1231, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1236, in _create_session
    return self._sess_creator.create_session()
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 902, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 669, in create_session
    init_fn=self._scaffold.init_fn)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/session_manager.py", line 295, in prepare_session
    config=config)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/session_manager.py", line 199, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1596, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 711, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable

I also verified if TF is able to see the GPUs using,

tf.config.list_physical_devices('GPU')

And it is able to see all the GPUs.

And I made sure there isn’t any process locking the GPU. Do I have to import ray and TF in a specific order or something?

Can you try this workaround (by putting it into the setup of your Trainable)?

def setup():
   os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = "1"

No luck! Same error :frowning:
As I’m able to train on GPUs outside of ray.tune, I think it is some setting on tune’s side. To iteratively explore and debug any other ideas, we can setup a live debug session. I’m free after 12pm PT rest of the week.