Episode_reward_mean same across different episodes in continuous environment

Hi all,

I am running SAC on LunarLanderContinuous-v2 and I have noticed the following weird behaviour: the mean reward is moving in steps, i.e. it will stay the same for a couple of episodes, then change a little, then stay the same again for a couple of episodes and so on and so forth.

This does not make sense to me, since i find it highly unlikely that multiple consecutive runs in LunarLanderContinuous-v2 result in the exact same score (let alone this score being the same with the current mean!)

PS. I am aware of this similar question however that one refered to a gridworld environment where the mean reward being the same for the first runs indeed made sense as remarked by the answer. My case is significantly different though as this is a continuous environment which should exibit significant reward variations and I am observing this pattern throughout the training period (not just in the first steps).

Relevant code:

trainer = sac.SACTrainer(config=config, env="LunarLanderContinuous-v2")

for i in range(1000):
    # Perform one iteration of training the policy with PPO
    result = trainer.train()

    print('iteration: {}'.format(i))
    print("episode_reward_mean: {}".format(result["episode_reward_mean"]))
    print("episode_reward_max: {}".format(result["episode_reward_max"]))
    print("episode_reward_min: {}".format(result["episode_reward_min"]))
    print("time_this_iter_s: {}".format(result["time_this_iter_s"]))

Excerpt from the logs:

episode_reward_mean: 226.6910608187078
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 10.656638860702515
iteration: 881
episode_reward_mean: 226.6910608187078
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 10.788938999176025
iteration: 882
episode_reward_mean: 228.5464098614901
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 10.638497591018677
iteration: 883
episode_reward_mean: 228.5464098614901
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 10.65201711654663
iteration: 884
episode_reward_mean: 228.3104440204209
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 10.721709728240967
iteration: 885
episode_reward_mean: 228.3104440204209
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 10.704133987426758
iteration: 886
episode_reward_mean: 228.3104440204209
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 11.02923846244812
iteration: 887
episode_reward_mean: 228.3104440204209
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 11.499016523361206
iteration: 888
episode_reward_mean: 228.3104440204209
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 11.674678325653076
iteration: 889
episode_reward_mean: 225.00186143748047
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 10.89946985244751

@drlatvia,

If you print results["hist_stats"]["episode_reward"]are they the same or different within/between iterations?

@mannyv,

Thanks a lot for your reply - I didn’t know about “hist_stats”!

So I printed this out and it seems that for some iterations, no new episode reward is appended to this list - like the episode never happened. This explains why the mean stayed the same accross a number of iterations.

Now, I wonder why these episodes do not produce rewards… Could it be because they reach the maximum step count before the episode terminates? I would think that in that case the episode would still have a score (sum of all the rewards up to that point) but maybe RLlib treats this differently?

@drlatvia,

The episode stats are not updated until an episode returns done=True.

You have two options here if you want t to try and change what it is doing now.

  1. If you set the configuration key ‘horizon’ to a positive non-zero value it rllib will artificially terminate an episode after that many steps.

  2. Assuming your episode do actually end, if you have an idea of how many steps an episode is taking then you can increase timesteps_per_iteration to increase the length of an iteration. The default for SAC is 100.


Thanks a lot @mannyv,

This makes perfect sense. Could you please elaborate what is the difference between "horizon" and "timesteps_per_iteration"? Sounds like they are have a very similar effect…

Best Regards,

@drlatvia

horizon will terminate an episode in rllib. Even if the episode is not actually done it will be recorded as done, rllib will terminate it and start a new one. This has implications for things training the value function and computing the advantage.

timesteps_per_iteration indicates the minimum number of new environment steps (across all your parallel environments if you have them) that must be collected before an iteration can finish.

An environment need not be done at the end of an iteration. If you had an environment that was on step 20 at the end of the first iteration then on the second iteration it would continue where it left off on step 21.


1 Like

@mannyv, this truly enlightening.

I was under the impression that iterations and episodes were roughly equivalent. My understanding now is that an iteration will finish if timesteps_per_iterationhave passed and the rest of the episode will be played out in the next iteration.

Thanks a lot for clarifying this.

All the best,

@drlatvia

Correct. There is one more configuration key that affects iteration length. min_iter_time_s which says that an iteration must last at least that number of seconds. The combination of those two keys will determine the length of an iteration. And like the other one or will continue episodes between iterations.