I am running SAC on LunarLanderContinuous-v2 and I have noticed the following weird behaviour: the mean reward is moving in steps, i.e. it will stay the same for a couple of episodes, then change a little, then stay the same again for a couple of episodes and so on and so forth.
This does not make sense to me, since i find it highly unlikely that multiple consecutive runs in LunarLanderContinuous-v2 result in the exact same score (let alone this score being the same with the current mean!)
PS. I am aware of this similar question however that one refered to a gridworld environment where the mean reward being the same for the first runs indeed made sense as remarked by the answer. My case is significantly different though as this is a continuous environment which should exibit significant reward variations and I am observing this pattern throughout the training period (not just in the first steps).
Relevant code:
trainer = sac.SACTrainer(config=config, env="LunarLanderContinuous-v2")
for i in range(1000):
# Perform one iteration of training the policy with PPO
result = trainer.train()
print('iteration: {}'.format(i))
print("episode_reward_mean: {}".format(result["episode_reward_mean"]))
print("episode_reward_max: {}".format(result["episode_reward_max"]))
print("episode_reward_min: {}".format(result["episode_reward_min"]))
print("time_this_iter_s: {}".format(result["time_this_iter_s"]))
Thanks a lot for your reply - I didnât know about âhist_statsâ!
So I printed this out and it seems that for some iterations, no new episode reward is appended to this list - like the episode never happened. This explains why the mean stayed the same accross a number of iterations.
Now, I wonder why these episodes do not produce rewards⌠Could it be because they reach the maximum step count before the episode terminates? I would think that in that case the episode would still have a score (sum of all the rewards up to that point) but maybe RLlib treats this differently?
The episode stats are not updated until an episode returns done=True.
You have two options here if you want t to try and change what it is doing now.
If you set the configuration key âhorizonâ to a positive non-zero value it rllib will artificially terminate an episode after that many steps.
Assuming your episode do actually end, if you have an idea of how many steps an episode is taking then you can increase timesteps_per_iteration to increase the length of an iteration. The default for SAC is 100.
This makes perfect sense. Could you please elaborate what is the difference between "horizon" and "timesteps_per_iteration"? Sounds like they are have a very similar effectâŚ
horizon will terminate an episode in rllib. Even if the episode is not actually done it will be recorded as done, rllib will terminate it and start a new one. This has implications for things training the value function and computing the advantage.
timesteps_per_iteration indicates the minimum number of new environment steps (across all your parallel environments if you have them) that must be collected before an iteration can finish.
An environment need not be done at the end of an iteration. If you had an environment that was on step 20 at the end of the first iteration then on the second iteration it would continue where it left off on step 21.
I was under the impression that iterations and episodes were roughly equivalent. My understanding now is that an iteration will finish if timesteps_per_iterationhave passed and the rest of the episode will be played out in the next iteration.
Correct. There is one more configuration key that affects iteration length. min_iter_time_s which says that an iteration must last at least that number of seconds. The combination of those two keys will determine the length of an iteration. And like the other one or will continue episodes between iterations.