My env returns -0.1 at the most time, and sometimes it will return 1 as the reward. But for a while, about 10 iters, the episode reward mean was the same number, why? I don’t think the episode reward mean doesn’t change for 10 iters, because the reward is -0.1 at the most time.
Maybe your agent hasn’t learnt to reach the goal (+1) yet? This is a typical behavior for grid worlds where with per-step=-0.1 reward and some positive goal reward (along with episode termination).
Sounds completely normal.
Btw, the reported value under
episode_reward_mean is the average reward from the train_batch used in that iteration.