Memory efficiency in extremely long-horizon environments

I have an environment >>1000 timestep horizon environment with up to 1024 concurrent agents. Running PPO, a single copy of the environment easily exhausts 64 GB of RAM. OpenAI Five (, page 31-32) uses a value function bootstrapping inspired approach (well, they also predict win probability) and splits trajectories into smaller segments. Might it be possible to do something similar in RLlib, or do you have other ideas on how to support very long time horizons?


Actually, the new trajectory view API might help you here. It’s enabled by default for most major algos ((A|DD)PPO, SAC, DQN, A2/3C, IMPALA, DDPG, TD3, PG) in both torch and tf and helps save some memory during sample collection (e.g. next_ob is not passed on to the the trainer process).
For LSTMs - with the traj. view API - only the needed internal-state vectors (at the max_seq_len chunk-edges) will be transferred and stored. Also there is no double storage anymore of state_in/state_out (they are all the same, just shifted by 1, so we save 50% of memory during sample collection).