Extra step after environment is terminated


I am working with an environment that gets a new set of timeseries data once an episode is created (through callbacks.on_episode_created).

For evaluation, I have one worker with one environment that gets a set of test data in a custom evaluation function. Each time I return terminated=True from my env.step() method, the episode is terminated correctly and a SampleBatch is returned.

However, then the env is reset and the step() method is called again (once), creating a new state-action pair. This observation ends up in the new Episode. I do not want this because I expect the new episode to contain a sample generated with the next policy, after a gradient update.


  1. Why is a new step executed immediately after a SampleBatch is returned?
  2. In my specific case where every episode should be sampled with different piece of (timeseries) data, how do I make sure that my custom evaluation function can start with a fresh empty episode? Without a step from the previous evaluation step? In other words, can I prevent the step after the ‘terminated’?

Thanks in advance!

1 Like

Maybe some more information is in order.
This is part of the custom evaluation function:

def sample(w):
    key, data = ('test', test_set) if w.worker_index == 1 else ('eval', eval_set)
    w.foreach_env(lambda e: e.set_data(data, evaluate=True))

    sample_batch = w.sample()['default_policy']
    return key, (sample_batch['infos'][-1], sample_batch['state_in_0'])

batch = dict(workers.foreach_worker(func=sample, local_worker=False))

The first time this function is run, my sequence length 740. The second time, it is length 741, because there is already one step in the episode. This is not desirable because I work with RNN sequences, so the states must make sense sequentially.

First batch:

SampleBatch(740 (seqs=1): ['obs', 'new_obs', 'rewards', 'terminateds', 'truncateds', 'infos', 'eps_id', 'unroll_id', 'agent_index', 't', 'state_in_0', 'vf_preds', 'values_bootstrapped', 'advantages', 'value_targets'])

Second batch:

SampleBatch(741 (seqs=1): ['obs', 'new_obs', 'rewards', 'terminateds', 'truncateds', 'infos', 'eps_id', 'unroll_id', 'agent_index', 't', 'state_in_0', 'vf_preds', 'values_bootstrapped', 'advantages', 'value_targets'])

This does not make sense because the data is still of length 740.

  1. How do you recommend to work around this?

  2. Another smaller question/observation: In a SampleBatch, when the last value of terminateds is True, I expect the last value of infos to correspond to the info returned when the environment terminated. This is not the case for my environments. Is this to be expected?

1 Like

@sven1977 @ericl

I also notice this behaviour. After one episode a step is taken by the workers for the next episode. Not only during evaluation, but also during training. This is wrong, because this means that the first action is taken with the old policy → then a loss.backward() is performed → and then the new experience is sampled using the new policy. But based on the first action of the old policy.

Or do i understand this incorrectly?