I am working with an environment that gets a new set of timeseries data once an episode is created (through callbacks.on_episode_created).
For evaluation, I have one worker with one environment that gets a set of test data in a custom evaluation function. Each time I return terminated=True from my env.step() method, the episode is terminated correctly and a SampleBatch is returned.
However, then the env is reset and the step() method is called again (once), creating a new state-action pair. This observation ends up in the new Episode. I do not want this because I expect the new episode to contain a sample generated with the next policy, after a gradient update.
- Why is a new step executed immediately after a SampleBatch is returned?
- In my specific case where every episode should be sampled with different piece of (timeseries) data, how do I make sure that my custom evaluation function can start with a fresh empty episode? Without a step from the previous evaluation step? In other words, can I prevent the step after the ‘terminated’?
Thanks in advance!