How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi,
I am deploying a ray cluster with kubernetes and on them are running docker images based on the image: rayproject/ray:2.2.0-py310-gpu.
I am running the training example provided here: torch_fashion_mnist_example — Ray 2.2.0
If I run the training code in the docker image on my local computer (using ray with one worker) it works as intended but on the cluster I instead receive the error:
1397 ray.exceptions.RayTaskError(RuntimeError): e[36mray::_Inner.train()e[39m (pid=191, ip=10.244.14.4, repr=TorchTrainer)
1398 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 367, in train
1399 raise skipped from exception_cause(skipped)
1400 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
1401 ray.get(object_ref)
1402 ray.exceptions.RayTaskError(RuntimeError): e[36mray::RayTrainWorker._RayTrainWorker__execute()e[39m (pid=239, ip=10.244.14.4, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fdad589af50>)
1403 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 31, in __execute
1404 raise skipped from exception_cause(skipped)
1405 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
1406 train_func(*args, **kwargs)
1407 File "/tmp/ray/session_2023-01-30_05-16-35_701765_13/runtime_resources/working_dir_files/_ray_pkg_dbbcfa66c4736b93/ray_launcher/ray_example_job.py", line 102, in train_func
1408 train_dataloader = train.torch.prepare_data_loader(train_dataloader)
1409 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/torch/train_loop_utils.py", line 131, in prepare_data_loader
1410 return get_accelerator(_TorchAccelerator).prepare_data_loader(
1411 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/torch/train_loop_utils.py", line 444, in prepare_data_loader
1412 data_loader = _WrappedDataLoader(data_loader, device, auto_transfer)
1413 File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/torch/train_loop_utils.py", line 556, in __init__
1414 torch.cuda.Stream(device)
1415 File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/cuda/streams.py", line 37, in __new__
1416 return super(Stream, cls).__new__(cls, priority=priority, **kwargs)
1417 RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
1418 CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
1419 For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
What could be the cause of the worker not being able to access the GPU/why the GPU is busy since there is no other job running on the cluster.
Edit:
The same errors do not happen with the docker image: rayproject/ray:2.0.0-py39-gpu