I would like to discuss with you an interesting observation which I recently regarding the episode_reward_mean which is logged for each training iteration.
The episode_reward_mean visible in the progress.csv file is always calculated taking into account ALL episodes which have been executed in the trial. Data reference can be obtained from hist_stats/episode_reward.
However, what do you think about an episode_reward_mean_per_iteration ? In that case, the metric would calculate a mean on the NEW episodes occurring in the current iteration ONLY.
I see following benefits:
Measurement on convergence to a (local) optimum in the reward function
Better judgement on solution quality in the sense of “will any more iterations make sense”
As far as I am aware it does not use all episodes in the trial it only uses a configurable number of the most recent n episodes. This can be configured in the reporting options. Default is 100.
metrics_num_episodes_for_smoothing – Smooth rollout metrics over this many episodes, if possible. In case rollouts (sample collection) just started, there may be fewer than this many episodes in the buffer and we’ll compute metrics over this smaller number of available episodes. In case there are more than this many episodes collected in a single training iteration, use all of these episodes for metrics computation, meaning don’t ever cut any “excess” episodes. Set this to 1 to disable smoothing and to always report only the most recently collected episode’s return.