Change the episode metrics as the timeframe is a bit vague?

Currently episode metrics are logged like this with a window in the EnvRunner:

    win = config.metrics_num_episodes_for_smoothing
    logger.log_value("return_mean", ret, window=win)
    logger.log_value("return_min", ret, reduce="min", window=win)
    logger.log_value("return_max", ret, reduce="max", window=win)

Now I don’t think that is the best way to do it, when we do not known the episodes, especially during evaluation this then is off when win != len(episodes_seen).

Case 1: win < episodes_seen:
Only the the sub results episodes_seen[-win:] will actually be logged, meaning the first episodes samples are discarded without logging their results to min/mean/max.

Case 2: win > episodes_seen:
In that case old episodes from evaluation_interval iterations before are included in the result; mean/min/max will be influenced by those old episodes, e.g.

from ray.rllib.utils.metrics.metrics_logger import MetricsLogger
logger = MetricsLogger()

win = 10

# Medicore training for two eval periods

for ret in range(10):
    logger.log_value("return_mean", ret, window=win,)
    logger.log_value("return_min", ret, reduce="min", window=win)
    logger.log_value("return_max", ret, reduce="max", window=win)
print(logger.reduce())

# Assume some training progress much better model now:

for ret in range(100, 150, 10):
    logger.log_value("return_mean", ret, window=win)
    logger.log_value("return_min", ret, reduce="min", window=win)
    logger.log_value("return_max", ret, reduce="max", window=win)
print(logger.reduce())

Here the reported metrics will now be, which are (way) below the current performance of the mode.

mean: 63,5 < 100
min: 5 <<< 100

For evaluation this can be fixed when using a evaluation_config with evaluation_duration_unit: "episodes" & metrics_num_episodes_for_smoothing=evaluation_interval


But for training where one rather uses timesteps the amount of episodes is not constant and when I have a large batch_size I likely have a larger amount of completed episodes, meaning that during logging some episode rewards are not included in the return_mean (Case 1) and if I have not many completed episodes old metrics can bleed over (Case 2) in the worst case even over multiple iterations.
Possibly some smoothing with the last iterations is fine, and the default metrics_num_episodes_for_smoothing=100 is fine.

However, if I am really just interested in the current iteration there is no way to currently get this result.
I would like to propose a metrics_num_episodes_for_smoothing="iteration" or a metrics_num_episodes_for_smoothing_unit: "iterations" | "episodes" value that has an infinite window and clears the gathered metrics on metrics.reduce() instead of an iteration-spanning window.