How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I’m running several ML experiments with Python/PyTorch on a shared server with several GPUs and CPUs. The trials start normally and proceed for a while until some of them just stop. E.g. the pink line has done only 2 steps.
Looking at the dashboard, it seems that the trials get stuck in the training epoch. By looking at the tasks for stuck trial, I usually see one .train running for minutes/hours even though a single epoch should just take ~30/40s
Digging deeper, I decided to place tqdm on the batch iterator and it looks like the trial gets stuck at the beginning of the epoch. E.g. stderr file from one stuck task:
16%|█▌ | 8/50 [04:54<26:18, 37.59s/it] # epoch counter
0%| | 0/887 [00:00<?, ?it/s]e[A # batch counter did not start
Following this, I tried to adjust the num_workers in the Pytorch dataloader, and setting num_workers=0 solves the issue. However, this comes at a cost that the runs will take ages to finish.
I still cannot explain this to myself and I’m not sure what causes this, especially because:
- Running the same type of experiments on my local machine with ray tune and num_workers>0 works fine
- Running single trials without ray tune on the server with num_workers > 0 also works fine
Does anybody know what is happening and why having multiple workers in the dataloader leads to ray tune to get stuck (on some machine)?
Other additional info:
- No major python package differences (from what I can see) between the server and my local machine
- Even if I run ~4 trials on a single GPU with num_workers>0 it works fine.
- Packages versions:
ray: 2.6.3
python: 3.8.13
pytorch: 1.10.2
CUDA: 11.1.74