Hi @LeoLeoLeo ,
You were already looking at the correct value to track reward during training, episode_reward_mean
. Each time it is logged it is the mean of the most recent 100 completed episodes. Those are the ones in hist_stats
. If no episodes complete during the sample phase then the value will stay the same since hist_stats
will not change either.
You can control the number of episodes in hist_stats
with the reporting argument metrics_num_episodes_for_smoothing
.
What are you trying to accomplish with the truncation? Unless you have some specific need, like for example your environment never terminates, or you want to ensure a maximum length to for example timeout if an agent gets stuck, there is no need to truncate an episode. In many cases, variable length episodes and episodes of unknown length work just fine in RLLIB and PPO without any special treatment.