CUDA-capable device(s) is/are busy or unavailable

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi,
I am deploying a ray cluster with kubernetes and on them are running docker images based on the image: rayproject/ray:2.2.0-py310-gpu.

I am running the training example provided here: torch_fashion_mnist_example — Ray 2.2.0

If I run the training code in the docker image on my local computer (using ray with one worker) it works as intended but on the cluster I instead receive the error:

1397 ray.exceptions.RayTaskError(RuntimeError): e[36mray::_Inner.train()e[39m (pid=191, ip=10.244.14.4, repr=TorchTrainer)
1398  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 367, in train
1399    raise skipped from exception_cause(skipped)
1400  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
1401    ray.get(object_ref)
1402 ray.exceptions.RayTaskError(RuntimeError): e[36mray::RayTrainWorker._RayTrainWorker__execute()e[39m (pid=239, ip=10.244.14.4, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fdad589af50>)
1403  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
1404    raise skipped from exception_cause(skipped)
1405  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
1406    train_func(*args, **kwargs)
1407  File "/tmp/ray/session_2023-01-30_05-16-35_701765_13/runtime_resources/working_dir_files/_ray_pkg_dbbcfa66c4736b93/ray_launcher/ray_example_job.py", line 102, in train_func
1408    train_dataloader = train.torch.prepare_data_loader(train_dataloader)
1409  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/torch/train_loop_utils.py", line 131, in prepare_data_loader
1410    return get_accelerator(_TorchAccelerator).prepare_data_loader(
1411  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/torch/train_loop_utils.py", line 444, in prepare_data_loader
1412    data_loader = _WrappedDataLoader(data_loader, device, auto_transfer)
1413  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/torch/train_loop_utils.py", line 556, in __init__
1414    torch.cuda.Stream(device)
1415  File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/cuda/streams.py", line 37, in __new__
1416    return super(Stream, cls).__new__(cls, priority=priority, **kwargs)
1417 RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
1418 CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
1419 For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

What could be the cause of the worker not being able to access the GPU/why the GPU is busy since there is no other job running on the cluster.

Edit:
The same errors do not happen with the docker image: rayproject/ray:2.0.0-py39-gpu

This seems to be a problem with the ray-2.2.0-py310 docker images.
I downloaded the ray-ml:2.2.0-gpu docker image and ran it with the example training code linked above with the same error as result.