DD-PPO RolloutWorker Hangs

Mark_Zhang · April 14, 2021, 9:35pm

Hi guys,

I am using DD-PPO for a GPU-required environment.

For testing purpose, I started the learning with only four rollout workers. It can learn for a few iterations, but then the learning stops silently. When I check the GPU usage after the learning stops, I can see the following:

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1     19824      C   ray::RolloutWorker.par_iter_next()          1875MiB |
|    1     19822      C   ray::RolloutWorker                          1875MiB |
|    1     19823      C   ray::RolloutWorker                          1877MiB |
|    1     19821      C   ray::RolloutWorker                          1877MiB |
+-----------------------------------------------------------------------------+

This lasts forever.

It looks to me that the worker with PID 19824 hangs in the process of par_iter_next() while all other workers are waiting. The time this problem occurs is random: somethings in the first iteration, sometimes later.

Do you have any comments on the possible cause for this? Any suggestion on how to debug (e.g., how to check at which step the rollout worker hangs)? Thanks a lot!

sven1977 · April 20, 2021, 4:41pm

Hey @Mark_Zhang , sorry, I’m not sure what this could be. Could you file a github issue and assign it to me with a self-sufficient reproduction script?

Thanks

Mark_Zhang · May 6, 2021, 11:11pm

Thanks @sven1977, I figured it out while I was working on the self-sufficient example.

Sorry I missed a detail in the earlier post as I thought it was not related, but it does.

I was trying to make DD-PPO work with multi-agent environment (multi-agent env with agents could have early dones; agents also share the same policy).

I removed the assertion at this line since it does not stand for multi-agent environment training. I thought it shouldn’t matter as all sampled batch experience are used for training the same policy. But this causes the hanging problem if the batches from different workers (with different sizes) are divided into different numbers of mini-batches using the same config[“sgd_minibatch_size”]. Please see illustration below. This also explains the random occurrence of the hanging.

My workaround for this is to use an adaptive mini-batch size, i.e., “batch.size/NUM_MINI_BATCH”, to make sure every worker gets the same mini-batch number, and this worked out. Any comments on this? More graceful ways to handle this?

sven1977 · May 27, 2021, 6:50am

This looks great @Mark_Zhang ! Could you provide a PR with your fix? I’m assuming it only affects DD-PPO and if you know that it’s learning in your case, we should merge this to help others use DD-PPO in a multi-agent setting.

Topic		Replies	Views
How to troubleshoot hang during a train rollout? RLlib	4	335	November 24, 2022
Run DD-PPO in multiple GPUs RLlib	2	356	September 30, 2021
Total Workers == (Number of GPUS) - 1? Configure Algorithm, Training, Evaluation, Scaling	1	1076	February 9, 2023
PPO agent training hang Configure Algorithm, Training, Evaluation, Scaling	0	82	May 19, 2024
How to restart stalled/hanging workers? RLlib	0	8	November 25, 2024

DD-PPO RolloutWorker Hangs

Related topics