Training on an unstable environment

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi. I am trying to figure out the proper way of using RLlib’s recreate_failed_workers and Tune’s max_failures.

I have an environment, which from time to time might raise an exception (this is expected). In this scenario, the solution is to simply restart the environment. For that, I am using ResetOnExceptionWrapper. However, sometimes even this solution doesn’t help, as multiple exceptions might happen in reset(). These are rare events, but still - happen.

So, I investigated further and found this guide on recreate_failed_workers. I gave it a try, but it seemed to a) recreate the workers only when all workers eventually failed, and b) it raised RuntimeError: Failed to recover from worker crash. after some time.

I found out that if I additionally set Tune’s max_failures to -1, then it seems to work. However, I am not convinced this is the proper way. Can you guys please clarify this for me?

Hi @iamhatesz ,

RLlib was written in a way that sometimes, errors can not be caught by RLlib it self (in which case you will have to adapt Tune’s max_failures). Inside of RLlib, you can only adapt ignore_worker_failures and recreate_failed_workers. I’m sorry but the only way to deal with such extremely fragile envs right now is setting max_failures to -1 as you did. However, making changes to this is on our roadmap and you will very likely see improvements catering to your situation in the coming months. I recommend checking the master branch or release updates in 1-3 months depending on how important this is to you! :slight_smile: