Training on an unstable environment

iamhatesz · August 18, 2022, 7:57am

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi. I am trying to figure out the proper way of using RLlib’s recreate_failed_workers and Tune’s max_failures.

I have an environment, which from time to time might raise an exception (this is expected). In this scenario, the solution is to simply restart the environment. For that, I am using ResetOnExceptionWrapper. However, sometimes even this solution doesn’t help, as multiple exceptions might happen in reset(). These are rare events, but still - happen.

So, I investigated further and found this guide on recreate_failed_workers. I gave it a try, but it seemed to a) recreate the workers only when all workers eventually failed, and b) it raised RuntimeError: Failed to recover from worker crash. after some time.

I found out that if I additionally set Tune’s max_failures to -1, then it seems to work. However, I am not convinced this is the proper way. Can you guys please clarify this for me?

arturn · September 4, 2022, 3:20pm

Hi @iamhatesz ,

RLlib was written in a way that sometimes, errors can not be caught by RLlib it self (in which case you will have to adapt Tune’s max_failures). Inside of RLlib, you can only adapt ignore_worker_failures and recreate_failed_workers. I’m sorry but the only way to deal with such extremely fragile envs right now is setting max_failures to -1 as you did. However, making changes to this is on our roadmap and you will very likely see improvements catering to your situation in the coming months. I recommend checking the master branch or release updates in 1-3 months depending on how important this is to you!

Topic		Replies	Views
Worker Timeout and restart RLlib	0	435	February 15, 2022
RLlib crashes with more workers and envs RLlib	8	1165	February 16, 2023
Mini forum guide/self-help guide RLlib	8	580	June 7, 2023
Issues after upgrading from 1.6.0 fro 1.7.0 RLlib	3	402	October 17, 2021
Rolloutworker value error Configure Algorithm, Training, Evaluation, Scaling	1	324	May 24, 2023

Training on an unstable environment

Related topics