CUDA error: all CUDA-capable devices are busy or unavailable

I have some Pytorch code incorporating Ray Tune which runs fine on my laptop (on CPU), but when I try to run it on my computing cluster I get the following error:

2021-03-03 17:24:33,331	ERROR function_manager.py:498 -- Failed to load actor class ImplicitFunc.
(pid=6410) Traceback (most recent call last):
(pid=6410)   File "/.../lib/python3.8/site-packages/ray/function_manager.py", line 496, in _load_actor_class_from_gcs
(pid=6410)     actor_class = pickle.loads(pickled_class)
(pid=6410)   File "/.../lib/python3.8/site-packages/torch/storage.py", line 141, in _load_from_bytes
(pid=6410)     return torch.load(io.BytesIO(b))
(pid=6410)   File "/.../lib/python3.8/site-packages/torch/serialization.py", line 595, in load
(pid=6410)     return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
(pid=6410)   File "/.../lib/python3.8/site-packages/torch/serialization.py", line 774, in _legacy_load
(pid=6410)     result = unpickler.load()
(pid=6410)   File "/.../lib/python3.8/site-packages/torch/serialization.py", line 730, in persistent_load
(pid=6410)     deserialized_objects[root_key] = restore_location(obj, location)
(pid=6410)   File "/.../lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
(pid=6410)     result = fn(storage, location)
(pid=6410)   File "/.../lib/python3.8/site-packages/torch/serialization.py", line 155, in _cuda_deserialize
(pid=6410)     return storage_type(obj.size())
(pid=6410)   File "/.../lib/python3.8/site-packages/torch/cuda/__init__.py", line 462, in _lazy_new
(pid=6410)     return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
(pid=6410) RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

I’m running the following versions:

ray 1.2.0
pytorch 1.7.1
cuda 10.1

I’d submit an issue on GitHub but I’m not sure how to write code which would reproduce the problem.

Additional info: I ran this code on a GPU compute node with 2 2080 Tis

Update: It looks like the problem was that I had models outside of my training function which I was trying to reference, which I guess isn’t thread safe or something. Is there a way to, for example, have one pretrained model you want to reference from multiple trials? Like maybe if you were doing transfer learning or something?

Yeah, you could hypothetically do something like ref = ray.put(model) then in your training function, do model = ray.get(ref).

At the moment it seems like there’s no memory bottleneck in storing multiple copies of the models, but I’ll keep this in mind in case it becomes an issue!

I’m facing a similar problem now. Shared my Trainable here

@import-antigravity , what do you mean by I had models outside my training function? And how did you find those problematic definitions?