Ray Tune trials getting stuck in a deadlock

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m running several ML experiments with Python/PyTorch on a shared server with several GPUs and CPUs. The trials start normally and proceed for a while until some of them just stop. E.g. the pink line has done only 2 steps.

Looking at the dashboard, it seems that the trials get stuck in the training epoch. By looking at the tasks for stuck trial, I usually see one .train running for minutes/hours even though a single epoch should just take ~30/40s

Digging deeper, I decided to place tqdm on the batch iterator and it looks like the trial gets stuck at the beginning of the epoch. E.g. stderr file from one stuck task:

16%|█▌        | 8/50 [04:54<26:18, 37.59s/it]  # epoch counter
0%|          | 0/887 [00:00<?, ?it/s]e[A  # batch counter did not start

Following this, I tried to adjust the num_workers in the Pytorch dataloader, and setting num_workers=0 solves the issue. However, this comes at a cost that the runs will take ages to finish.

I still cannot explain this to myself and I’m not sure what causes this, especially because:

  • Running the same type of experiments on my local machine with ray tune and num_workers>0 works fine
  • Running single trials without ray tune on the server with num_workers > 0 also works fine

Does anybody know what is happening and why having multiple workers in the dataloader leads to ray tune to get stuck (on some machine)?

Other additional info:

  • No major python package differences (from what I can see) between the server and my local machine
  • Even if I run ~4 trials on a single GPU with num_workers>0 it works fine.
  • Packages versions:
ray: 2.6.3
python: 3.8.13
pytorch: 1.10.2 
CUDA: 11.1.74 

Hey are you still running into this problem? One suggestion is to use the Ray Dashboard to check the stacktrace, which might indicate where the processes are hanging.