CUDA error: all CUDA-capable devices are busy or unavailable

import-antigravity · March 3, 2021, 10:47pm

I have some Pytorch code incorporating Ray Tune which runs fine on my laptop (on CPU), but when I try to run it on my computing cluster I get the following error:

2021-03-03 17:24:33,331	ERROR function_manager.py:498 -- Failed to load actor class ImplicitFunc.
(pid=6410) Traceback (most recent call last):
(pid=6410)   File "/.../lib/python3.8/site-packages/ray/function_manager.py", line 496, in _load_actor_class_from_gcs
(pid=6410)     actor_class = pickle.loads(pickled_class)
(pid=6410)   File "/.../lib/python3.8/site-packages/torch/storage.py", line 141, in _load_from_bytes
(pid=6410)     return torch.load(io.BytesIO(b))
(pid=6410)   File "/.../lib/python3.8/site-packages/torch/serialization.py", line 595, in load
(pid=6410)     return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
(pid=6410)   File "/.../lib/python3.8/site-packages/torch/serialization.py", line 774, in _legacy_load
(pid=6410)     result = unpickler.load()
(pid=6410)   File "/.../lib/python3.8/site-packages/torch/serialization.py", line 730, in persistent_load
(pid=6410)     deserialized_objects[root_key] = restore_location(obj, location)
(pid=6410)   File "/.../lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
(pid=6410)     result = fn(storage, location)
(pid=6410)   File "/.../lib/python3.8/site-packages/torch/serialization.py", line 155, in _cuda_deserialize
(pid=6410)     return storage_type(obj.size())
(pid=6410)   File "/.../lib/python3.8/site-packages/torch/cuda/__init__.py", line 462, in _lazy_new
(pid=6410)     return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
(pid=6410) RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

I’m running the following versions:

ray 1.2.0
pytorch 1.7.1
cuda 10.1

I’d submit an issue on GitHub but I’m not sure how to write code which would reproduce the problem.

Additional info: I ran this code on a GPU compute node with 2 2080 Tis

import-antigravity · March 3, 2021, 11:19pm

Update: It looks like the problem was that I had models outside of my training function which I was trying to reference, which I guess isn’t thread safe or something. Is there a way to, for example, have one pretrained model you want to reference from multiple trials? Like maybe if you were doing transfer learning or something?

rliaw · March 4, 2021, 1:44am

Yeah, you could hypothetically do something like ref = ray.put(model) then in your training function, do model = ray.get(ref).

import-antigravity · March 4, 2021, 4:05am

At the moment it seems like there’s no memory bottleneck in storing multiple copies of the models, but I’ll keep this in mind in case it becomes an issue!

Nitin_Pasumarthy · February 11, 2022, 7:03am

I’m facing a similar problem now. Shared my Trainable here

@import-antigravity , what do you mean by I had models outside my training function? And how did you find those problematic definitions?

Topic		Replies	Views
Attempting to deserialize object on a CUDA device... error on 2 GPU machine Ray Tune	3	3006	April 6, 2021
CUDA-capable device(s) is/are busy or unavailable Ray Clusters	1	936	February 1, 2023
[Ray Core] RuntimeError: No CUDA GPUs are available Ray Core	5	4968	October 15, 2022
RuntimeError: No CUDA GPUs are available Ray Tune	12	14819	February 3, 2023
Ray Tune - CUDA OOM Error Ray Tune	0	507	July 26, 2021

CUDA error: all CUDA-capable devices are busy or unavailable

Related topics