Hi guys,
I am using DD-PPO for a GPU-required environment.
For testing purpose, I started the learning with only four rollout workers. It can learn for a few iterations, but then the learning stops silently. When I check the GPU usage after the learning stops, I can see the following:
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 19824 C ray::RolloutWorker.par_iter_next() 1875MiB |
| 1 19822 C ray::RolloutWorker 1875MiB |
| 1 19823 C ray::RolloutWorker 1877MiB |
| 1 19821 C ray::RolloutWorker 1877MiB |
+-----------------------------------------------------------------------------+
This lasts forever.
It looks to me that the worker with PID 19824 hangs in the process of par_iter_next()
while all other workers are waiting. The time this problem occurs is random: somethings in the first iteration, sometimes later.
Do you have any comments on the possible cause for this? Any suggestion on how to debug (e.g., how to check at which step the rollout worker hangs)? Thanks a lot!