Status: all CUDA-capable devices are busy or unavailable

Nitin_Pasumarthy · February 11, 2022, 7:01am

I verified that my models are defined inside the Trainable (as suggested here) but still see this error when I run it from tune.run.

I’m using ray 1.8, TF 2.4 and CUDA 11.1


exp_path = Path("./").resolve()
train_files, test_files = transform_data(raw_csv_files) # processes the data as a pandas dfs without

class MyModel(Trainable):
    def _dense_model(self):
        self.model_dir = Path(self.logdir)/"est_model_dir"
        dnn_est = tf.estimator.DNNClassifier(
            ...
            optimizer=lambda: tf.keras.optimizers.Adam(
                learning_rate=self.config["lr"]
            )
        )
        logging.warning("Saving model data to %s" % self.model_dir)
        return dnn_est
    
    def get_data(self):
        input_fn = partial(
            tf.compat.v1.data.experimental.make_csv_dataset,
            batch_size=self.config["batch_size"],
        ...)
        self.train_input_fn = lambda: input_fn(
            file_pattern=train_files)
        self.test_input_fn = lambda: input_fn(
            file_pattern=test_files)

    def setup(self, ray_config):
        # Ray sets current directory to trials directory for every trial.
        # Ray dev suggested this change https://groups.google.com/d/msg/ray-dev/T5q7DkkGlpQ/ZzbOTIsVBwAJ
        os.chdir(exp_path)  # to make sure all files saved by me are relative to this path
        self.model = self._dense_model()
        self.get_data()

    def step(self):
        self.model.train(self.train_input_fn, steps=self.config["epochs"] * self.config["steps_per_epoch"])
        val_metrics = self.model.evaluate(self.test_input_fn, steps=self.config["steps_per_epoch"], name="val")
        train_metrics = ...
        return {**train_metrics, **val_metrics}

    def save_checkpoint(self, checkpoint_dir):
        return self.export_model(["py"]) # so I have the final model after all training

    def _export_model(self, export_formats, export_dir):
        return self.model.save(export_dir)

The below outside, tune.run runs on GPUs perfectly.

t = MyModel({
    ... # some config
})

t.train()

But tiggering it this way, fails with Status: all CUDA-capable devices are busy or unavailable

tune.run(
    MyModel,
    name=exp_name,
    local_dir="./ray_results/",
    resources_per_trial={'gpu': 1},
    config={
        ...
    },
    stop={"training_iteration": ...},
    max_failures=0,
    num_samples=...,
    checkpoint_at_end=True, # so the final model is exported
    callbacks=[
        MLflowLoggerCallback(
            experiment_name=exp_name,
            save_artifact=True
        )
    ],
)

What am I missing in the way I’m using Trainable?
How do I systematically detect any leaks (i.e. TF code written outside Trainable)?

Thank you, for helping me

rliaw · February 12, 2022, 12:36am

Nice!

Maybe try using Training (tune.Trainable, tune.report) — Ray v1.10.0 ?

Nitin_Pasumarthy · February 12, 2022, 2:29am

Thanks for taking a look, @rliaw

Why is a conditional wait_for_gpu not part of Trainable.train() by default? What was your intuition behind this? I see this behavior immediately and not during the experiment.
Do you see any leaks with my usage of Trainable?

rliaw · February 12, 2022, 6:21am

Can you try and see if it works if you remove the conditional?

Unfortunately I didn’t have much time to run the code exactly.

rliaw · February 13, 2022, 1:52am

So concretely, can you report back your results after you try doing tune.util.wait_for_gpu() in def step()?

Also, perhaps you could consider explicitly setting tune.run(reuse_actors=False)

Nitin_Pasumarthy · February 13, 2022, 6:26am

== Status ==
Current time: 2022-02-13 06:07:29 (running for 00:00:50.97)
Memory usage on this node: 19.9/xxx GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/8 CPUs, 1.0/2 GPUs, 0.0/43.38 GiB heap, 0.0/9.31 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /home/jovyan/...
Number of trials: 1/1 (1 RUNNING)

(pid=550) 2022-02-13 06:07:32,236	INFO util.py:462 -- Waiting for GPU util to reach 0.01. Util: 0.014

It failed after some time. So I changed my step() fn to tune.utils.wait_for_gpu(target_util=0.02). It continued but it failed due to the same error I opened this topic with.

(pid=347) 2022-02-13 06:21:54.914399: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
(pid=347) 2022-02-13 06:21:54.915827: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
(pid=347) 2022-02-13 06:21:54.933441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
(pid=347) pciBusID: 0000:da:00.0 name: Tesla V100-PCIE-32GB computeCapability: 7.0
(pid=347) coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
(pid=347) 2022-02-13 06:21:54.933483: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
(pid=347) 2022-02-13 06:21:54.937109: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
(pid=347) 2022-02-13 06:21:54.937187: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
(pid=347) 2022-02-13 06:21:54.938507: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
(pid=347) 2022-02-13 06:21:54.938785: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
(pid=347) 2022-02-13 06:21:54.942326: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
(pid=347) 2022-02-13 06:21:54.943053: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
(pid=347) 2022-02-13 06:21:54.943262: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
(pid=347) 2022-02-13 06:21:54.945908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-02-13 06:21:57,117	ERROR trial_runner.py:924 -- Trial MyModel_3b81c_00000: Error processing event.
Traceback (most recent call last):
  File "/home/jovyan/.local/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 890, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/jovyan/.local/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 788, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/jovyan/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/jovyan/.local/lib/python3.7/site-packages/ray/worker.py", line 1625, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(InternalError): ray::MyModel.train() (pid=347, ip=100.96.202.248, repr=<__main__.MyModel object at 0x7f229838e690>)
  File "/home/jovyan/.local/lib/python3.7/site-packages/ray/tune/trainable.py", line 283, in train
    result = self.step()
  File "<ipython-input-26-cc6f45585ccc>", line 92, in step
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1208, in _train_model_default
    saving_listeners)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1510, in _train_with_estimator_spec
    save_graph_def=self._config.checkpoint_save_graph_def) as mon_sess:
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 604, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1038, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 749, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1231, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1236, in _create_session
    return self._sess_creator.create_session()
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 902, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 669, in create_session
    init_fn=self._scaffold.init_fn)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/session_manager.py", line 295, in prepare_session
    config=config)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/training/session_manager.py", line 199, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1596, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/jovyan/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 711, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable

I also verified if TF is able to see the GPUs using,

tf.config.list_physical_devices('GPU')

And it is able to see all the GPUs.

And I made sure there isn’t any process locking the GPU. Do I have to import ray and TF in a specific order or something?

rliaw · February 15, 2022, 8:00am

Can you try this workaround (by putting it into the setup of your Trainable)?

def setup():
   os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = "1"

github.com/tensorflow/tensorflow

CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid?

opened 09:36PM - 02 Aug 20 UTC

closed 10:35PM - 06 Aug 20 UTC

abhipn

stat:awaiting tensorflower type:build/install subtype: ubuntu/linux TF 2.3

### System information - **Have I written custom code (as opposed to using …a stock example script provided in TensorFlow)**: No - **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: Linux Ubuntu 20.04 - **TensorFlow installed from (source or binary)**: binary - **TensorFlow version (use command below)**: v2.3.0-rc2-23-gb36436b087 2.3.0 - **Python version**: 3.8.2 - **CUDA/cuDNN version**: Cuda 10.1/ cuDNN 7.6.5 - **GPU model and memory**: Nvidia GTX 750Ti - **Exact command to reproduce**: ``` # TensorFlow and tf.keras import tensorflow as tf from tensorflow import keras fashion_mnist = keras.datasets.fashion_mnist (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() train_images = train_images / 255.0 test_images = test_images / 255.0 # At this step I was getting the error which I've posted below in the terminal. model = keras.Sequential([ keras.layers.Flatten(input_shape=(28, 28)), keras.layers.Dense(128, activation='relu'), keras.layers.Dense(10) ]) ``` ### Describe the problem I've recently installed ubuntu 20.04 LTS and it comes with python-3.8, so I'll installed **nvidia-cuda-toolkit** and **nvidia drivers** and I can confirm they are working fine. ``` $ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243 ``` ``` $ nvidia-smi Mon Aug 3 02:56:11 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 750 Ti Off | 00000000:01:00.0 On | N/A | | 27% 38C P0 1W / 38W | 245MiB / 1997MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 979 G /usr/lib/xorg/Xorg 20MiB | +-----------------------------------------------------------------------------+ ``` Now, I tried to build a small sequential model I am getting an error which says `InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid` I don't know what causing the issue. My linux ubuntu is a new installation. I have installed everything correctly. ### Source code / logs ``` 2020-08-03 02:48:40.720575: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 2020-08-03 02:48:40.750630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 750 Ti computeCapability: 5.0 coreClock: 1.137GHz coreCount: 5 deviceMemorySize: 1.95GiB deviceMemoryBandwidth: 80.47GiB/s 2020-08-03 02:48:40.750735: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2020-08-03 02:48:40.791690: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2020-08-03 02:48:40.815993: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2020-08-03 02:48:40.821924: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2020-08-03 02:48:40.863910: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2020-08-03 02:48:40.870559: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2020-08-03 02:48:40.945916: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2020-08-03 02:48:40.947130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 2020-08-03 02:48:40.979471: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3311130000 Hz 2020-08-03 02:48:40.980123: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4aa6700 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-08-03 02:48:40.980190: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-08-03 02:48:41.121266: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x49375f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2020-08-03 02:48:41.121357: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 750 Ti, Compute Capability 5.0 2020-08-03 02:48:41.122574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 750 Ti computeCapability: 5.0 coreClock: 1.137GHz coreCount: 5 deviceMemorySize: 1.95GiB deviceMemoryBandwidth: 80.47GiB/s 2020-08-03 02:48:41.122676: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2020-08-03 02:48:41.122762: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2020-08-03 02:48:41.122830: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2020-08-03 02:48:41.122898: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2020-08-03 02:48:41.122963: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2020-08-03 02:48:41.123029: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2020-08-03 02:48:41.123145: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2020-08-03 02:48:41.124618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 2020-08-03 02:48:41.124716: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 --------------------------------------------------------------------------- InternalError Traceback (most recent call last) <ipython-input-4-ac4dc71cdd20> in <module> ----> 1 model = keras.Sequential([ 2 keras.layers.Flatten(input_shape=(28, 28)), 3 keras.layers.Dense(128, activation='relu'), 4 keras.layers.Dense(10) 5 ]) /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py in _method_wrapper(self, *args, **kwargs) 455 self._self_setattr_tracking = False # pylint: disable=protected-access 456 try: --> 457 result = method(self, *args, **kwargs) 458 finally: 459 self._self_setattr_tracking = previous_value # pylint: disable=protected-access /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/keras/engine/sequential.py in __init__(self, layers, name) 114 """ 115 # Skip the init in FunctionalModel since model doesn't have input/output yet --> 116 super(functional.Functional, self).__init__( # pylint: disable=bad-super-call 117 name=name, autocast=False) 118 self.supports_masking = True /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py in _method_wrapper(self, *args, **kwargs) 455 self._self_setattr_tracking = False # pylint: disable=protected-access 456 try: --> 457 result = method(self, *args, **kwargs) 458 finally: 459 self._self_setattr_tracking = previous_value # pylint: disable=protected-access /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in __init__(self, *args, **kwargs) 306 self._steps_per_execution = None 307 --> 308 self._init_batch_counters() 309 self._base_model_initialized = True 310 _keras_api_gauge.get_cell('model').set(True) /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py in _method_wrapper(self, *args, **kwargs) 455 self._self_setattr_tracking = False # pylint: disable=protected-access 456 try: --> 457 result = method(self, *args, **kwargs) 458 finally: 459 self._self_setattr_tracking = previous_value # pylint: disable=protected-access /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in _init_batch_counters(self) 315 # `evaluate`, and `predict`. 316 agg = variables.VariableAggregationV2.ONLY_FIRST_REPLICA --> 317 self._train_counter = variables.Variable(0, dtype='int64', aggregation=agg) 318 self._test_counter = variables.Variable(0, dtype='int64', aggregation=agg) 319 self._predict_counter = variables.Variable( /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs) 260 return cls._variable_v1_call(*args, **kwargs) 261 elif cls is Variable: --> 262 return cls._variable_v2_call(*args, **kwargs) 263 else: 264 return super(VariableMetaclass, cls).__call__(*args, **kwargs) /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/ops/variables.py in _variable_v2_call(cls, initial_value, trainable, validate_shape, caching_device, name, variable_def, dtype, import_scope, constraint, synchronization, aggregation, shape) 242 if aggregation is None: 243 aggregation = VariableAggregation.NONE --> 244 return previous_getter( 245 initial_value=initial_value, 246 trainable=trainable, /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/ops/variables.py in <lambda>(**kws) 235 shape=None): 236 """Call on Variable class. Useful to force the signature.""" --> 237 previous_getter = lambda **kws: default_variable_creator_v2(None, **kws) 238 for _, getter in ops.get_default_graph()._variable_creator_stack: # pylint: disable=protected-access 239 previous_getter = _make_getter(getter, previous_getter) /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/ops/variable_scope.py in default_variable_creator_v2(next_creator, **kwargs) 2631 shape = kwargs.get("shape", None) 2632 -> 2633 return resource_variable_ops.ResourceVariable( 2634 initial_value=initial_value, 2635 trainable=trainable, /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs) 262 return cls._variable_v2_call(*args, **kwargs) 263 else: --> 264 return super(VariableMetaclass, cls).__call__(*args, **kwargs) 265 266 /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py in __init__(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, variable_def, import_scope, constraint, distribute_strategy, synchronization, aggregation, shape) 1505 self._init_from_proto(variable_def, import_scope=import_scope) 1506 else: -> 1507 self._init_from_args( 1508 initial_value=initial_value, 1509 trainable=trainable, /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py in _init_from_args(self, initial_value, trainable, collections, caching_device, name, dtype, constraint, synchronization, aggregation, distribute_strategy, shape) 1648 with ops.get_default_graph()._attr_scope({"_class": attr}): 1649 with ops.name_scope("Initializer"), device_context_manager(None): -> 1650 initial_value = ops.convert_to_tensor( 1651 initial_value() if init_from_fn else initial_value, 1652 name="initial_value", dtype=dtype) /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types) 1497 1498 if ret is None: -> 1499 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) 1500 1501 if ret is NotImplemented: /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/framework/tensor_conversion_registry.py in _default_conversion_function(***failed resolving arguments***) 50 def _default_conversion_function(value, dtype, name, as_ref): 51 del as_ref # Unused. ---> 52 return constant_op.constant(value, dtype, name=name) 53 54 /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name) 261 ValueError: if called on a symbolic tensor. 262 """ --> 263 return _constant_impl(value, dtype, shape, name, verify_shape=False, 264 allow_broadcast=True) 265 /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast) 273 with trace.Trace("tf.constant"): 274 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) --> 275 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) 276 277 g = ops.get_default_graph() /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py in _constant_eager_impl(ctx, value, dtype, shape, verify_shape) 298 def _constant_eager_impl(ctx, value, dtype, shape, verify_shape): 299 """Implementation of eager constant.""" --> 300 t = convert_to_eager_tensor(value, ctx, dtype) 301 if shape is None: 302 return t /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype) 95 except AttributeError: 96 dtype = dtypes.as_dtype(dtype).as_datatype_enum ---> 97 ctx.ensure_initialized() 98 return ops.EagerTensor(value, ctx.device_name, dtype) 99 /mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/eager/context.py in ensure_initialized(self) 537 if self._use_tfrt is not None: 538 pywrap_tfe.TFE_ContextOptionsSetTfrt(opts, self._use_tfrt) --> 539 context_handle = pywrap_tfe.TFE_NewContext(opts) 540 finally: 541 pywrap_tfe.TFE_DeleteContextOptions(opts) InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid ```

Nitin_Pasumarthy · February 15, 2022, 5:19pm

No luck! Same error
As I’m able to train on GPUs outside of ray.tune, I think it is some setting on tune’s side. To iteratively explore and debug any other ideas, we can setup a live debug session. I’m free after 12pm PT rest of the week.

Topic		Replies	Views
Error When Trying to Tune a Trainable Function	8	2552	August 29, 2023
Torch cuda not available within Tune Trainable Ray Tune	1	148	March 22, 2024
Tune.run() with docker is not using gpu Ray Tune	7	2095	May 31, 2022
Tune.run not executing actual trials Ray Tune	2	458	January 3, 2022
Getting Started with Ray does not work on any computer I try it Ray Tune	4	2393	September 13, 2023

Status: all CUDA-capable devices are busy or unavailable

Related topics