Ray Tune trials getting stuck in a deadlock

Karapo · September 7, 2023, 8:45am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I’m running several ML experiments with Python/PyTorch on a shared server with several GPUs and CPUs. The trials start normally and proceed for a while until some of them just stop. E.g. the pink line has done only 2 steps.

Looking at the dashboard, it seems that the trials get stuck in the training epoch. By looking at the tasks for stuck trial, I usually see one .train running for minutes/hours even though a single epoch should just take ~30/40s

Digging deeper, I decided to place tqdm on the batch iterator and it looks like the trial gets stuck at the beginning of the epoch. E.g. stderr file from one stuck task:

16%|█▌        | 8/50 [04:54<26:18, 37.59s/it]  # epoch counter
0%|          | 0/887 [00:00<?, ?it/s]e[A  # batch counter did not start

Following this, I tried to adjust the num_workers in the Pytorch dataloader, and setting num_workers=0 solves the issue. However, this comes at a cost that the runs will take ages to finish.

I still cannot explain this to myself and I’m not sure what causes this, especially because:

Running the same type of experiments on my local machine with ray tune and num_workers>0 works fine
Running single trials without ray tune on the server with num_workers > 0 also works fine

Does anybody know what is happening and why having multiple workers in the dataloader leads to ray tune to get stuck (on some machine)?

Other additional info:

No major python package differences (from what I can see) between the server and my local machine
Even if I run ~4 trials on a single GPU with num_workers>0 it works fine.
Packages versions:

ray: 2.6.3
python: 3.8.13
pytorch: 1.10.2 
CUDA: 11.1.74

matthewdeng · September 13, 2023, 12:54am

Hey are you still running into this problem? One suggestion is to use the Ray Dashboard to check the stacktrace, which might indicate where the processes are hanging.

Topic		Replies	Views
Training keeps getting stuck Debugging and performance tuning	3	1285	May 25, 2023
Ray Train hangs for long time Ray Train	11	1804	July 20, 2022
Error related to 'performance bottleneck' and 'start_trial'	0	942	September 29, 2021
Segfault in torchtrainer for num_workers > 0 in dataloader	1	638	April 9, 2023
Ray tune self terminates at 98 trials consistently Ray Tune	12	1372	March 15, 2023

Ray Tune trials getting stuck in a deadlock

Related topics