Hi all,
I am running SAC on LunarLanderContinuous-v2
and I have noticed the following weird behaviour: the mean reward is moving in steps, i.e. it will stay the same for a couple of episodes, then change a little, then stay the same again for a couple of episodes and so on and so forth.
This does not make sense to me, since i find it highly unlikely that multiple consecutive runs in LunarLanderContinuous-v2
result in the exact same score (let alone this score being the same with the current mean!)
PS. I am aware of this similar question however that one refered to a gridworld environment where the mean reward being the same for the first runs indeed made sense as remarked by the answer. My case is significantly different though as this is a continuous environment which should exibit significant reward variations and I am observing this pattern throughout the training period (not just in the first steps).
Relevant code:
trainer = sac.SACTrainer(config=config, env="LunarLanderContinuous-v2")
for i in range(1000):
# Perform one iteration of training the policy with PPO
result = trainer.train()
print('iteration: {}'.format(i))
print("episode_reward_mean: {}".format(result["episode_reward_mean"]))
print("episode_reward_max: {}".format(result["episode_reward_max"]))
print("episode_reward_min: {}".format(result["episode_reward_min"]))
print("time_this_iter_s: {}".format(result["time_this_iter_s"]))
Excerpt from the logs:
episode_reward_mean: 226.6910608187078
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 10.656638860702515
iteration: 881
episode_reward_mean: 226.6910608187078
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 10.788938999176025
iteration: 882
episode_reward_mean: 228.5464098614901
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 10.638497591018677
iteration: 883
episode_reward_mean: 228.5464098614901
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 10.65201711654663
iteration: 884
episode_reward_mean: 228.3104440204209
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 10.721709728240967
iteration: 885
episode_reward_mean: 228.3104440204209
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 10.704133987426758
iteration: 886
episode_reward_mean: 228.3104440204209
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 11.02923846244812
iteration: 887
episode_reward_mean: 228.3104440204209
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 11.499016523361206
iteration: 888
episode_reward_mean: 228.3104440204209
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 11.674678325653076
iteration: 889
episode_reward_mean: 225.00186143748047
episode_reward_max: 306.0011453243046
episode_reward_min: -73.61330238125129
time_this_iter_s: 10.89946985244751