I am currently training with 100 rollout workers and 4 envs per worker, so 400 environments in total. The dataframe size which makes up one episode for 1 environment is 50000 rows long. So, when I try to train with 400 environments, my episode becomes 22 million timesteps long.
I have found that the performance when the episodes are so long is worse than shorter episodes. For example, an episode length of 800k timesteps (10 workers each with 1 env) converges quicker in terms of number of timesteps but takes longer wall clock time.
Any ideas for how to fix this? I had the idea as to maybe I can segment off rollout workers in groups of 10 and have them run through their own episodes while still updating the central policy? I’m not sure if this is possible or how hard it would be to do this.
Here is the very long episode: