I have some Pytorch code incorporating Ray Tune which runs fine on my laptop (on CPU), but when I try to run it on my computing cluster I get the following error:
2021-03-03 17:24:33,331 ERROR function_manager.py:498 -- Failed to load actor class ImplicitFunc.
(pid=6410) Traceback (most recent call last):
(pid=6410) File "/.../lib/python3.8/site-packages/ray/function_manager.py", line 496, in _load_actor_class_from_gcs
(pid=6410) actor_class = pickle.loads(pickled_class)
(pid=6410) File "/.../lib/python3.8/site-packages/torch/storage.py", line 141, in _load_from_bytes
(pid=6410) return torch.load(io.BytesIO(b))
(pid=6410) File "/.../lib/python3.8/site-packages/torch/serialization.py", line 595, in load
(pid=6410) return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
(pid=6410) File "/.../lib/python3.8/site-packages/torch/serialization.py", line 774, in _legacy_load
(pid=6410) result = unpickler.load()
(pid=6410) File "/.../lib/python3.8/site-packages/torch/serialization.py", line 730, in persistent_load
(pid=6410) deserialized_objects[root_key] = restore_location(obj, location)
(pid=6410) File "/.../lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
(pid=6410) result = fn(storage, location)
(pid=6410) File "/.../lib/python3.8/site-packages/torch/serialization.py", line 155, in _cuda_deserialize
(pid=6410) return storage_type(obj.size())
(pid=6410) File "/.../lib/python3.8/site-packages/torch/cuda/__init__.py", line 462, in _lazy_new
(pid=6410) return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
(pid=6410) RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
I’m running the following versions:
ray 1.2.0
pytorch 1.7.1
cuda 10.1
I’d submit an issue on GitHub but I’m not sure how to write code which would reproduce the problem.
Additional info: I ran this code on a GPU compute node with 2 2080 Tis