I am trying to train two separate agents in two different environments that are to train in the same loop. The output from one env.step
is to be fed to the next environment before I call the compute_single_action on the next environment. Currently, in Rllib, everything seems to be encapsulated behind a .train() method with very little opportunity for customization during training.
The crux of the problem is that I cannot find a good example in the current RLLib version that shows how to perform the following steps explicitly.
- Set up an environment
- Set up the RLLib PPO agent
- In a for loop with iteration budget:
- choose an action based on the current state of the environment
- collect them in some form of a RLLib Buffer class if one exists
- after some periodic steps of budget, perform the PPO agent update with the Buffer
- Evaluate it at regular intervals.