- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi. I am trying to figure out the proper way of using RLlib’s recreate_failed_workers
and Tune’s max_failures
.
I have an environment, which from time to time might raise an exception (this is expected). In this scenario, the solution is to simply restart the environment. For that, I am using ResetOnExceptionWrapper. However, sometimes even this solution doesn’t help, as multiple exceptions might happen in reset()
. These are rare events, but still - happen.
So, I investigated further and found this guide on recreate_failed_workers
. I gave it a try, but it seemed to a) recreate the workers only when all workers eventually failed, and b) it raised RuntimeError: Failed to recover from worker crash.
after some time.
I found out that if I additionally set Tune’s max_failures
to -1, then it seems to work. However, I am not convinced this is the proper way. Can you guys please clarify this for me?