Env worker expiry time

vakker00 · October 2, 2021, 12:14pm

Hi,

I have a simulator that seems to have a memory leak and I get OOM errors during long trainings.

Is there a mechanism to set a limit for the env worker in terms of max episodes or max iters and then just recreate it from scratch?

Thanks

michaelzhiluo · October 3, 2021, 1:28am

You could try stopping the agent when it reaches X timesteps in the config: ray/atari-impala-large.yaml at master · ray-project/ray · GitHub

vakker00 · October 3, 2021, 10:01am

Thanks, but I’m not trying to stop the training with timesteps_total but rather I’m trying to set a limit to the amount of time an env worker is used during a long training.

mannyv · October 3, 2021, 1:34pm

Hi @vakker00,

Not as far as I have been able to find. I had this same issue for a long time. Finally I just had to find the memory leak and fix it.

I did try writing a callback that would restart the environment after so many episodes. That did not work great because I had many workers all using the same environment. I did not have independent instances for each worker instead they all interacted with a central application through websockets.

Can you share more info about your environment and how it is set up.

Another approach you could try is to set it up as an external env. In that setup the trainer does not control the workers it is purely a consumer of samples so you could implement restart functions in your polocy client.

vakker00 · December 15, 2021, 7:58pm

Thanks for the suggestion. I just ended up explicitly deleting the wrapped simulator after a certain number of episodes, i.e. in my env.reset there’s an if self.episodes % threshold == 0 then delete (and force GC). That’s a bit dirty, but seems to do the trick until the memory leak is fixed.

Topic		Replies	Views
Experiment slowing down after several hours of flawless training Ray Tune	5	542	June 21, 2023
Memory Pressure Issue Configure Algorithm, Training, Evaluation, Scaling	9	761	February 22, 2023
External Env crashes during training step RLlib	3	450	November 4, 2021
Help debugging a memory leak in rllib RLlib	21	3887	September 25, 2022
SAC trainer slows down drastically RLlib	6	670	May 29, 2022

Env worker expiry time

Related topics