Change the episode metrics as the timeframe is a bit vague?

Daraan · September 10, 2025, 9:51am

Currently episode metrics are logged like this with a window in the EnvRunner:

    win = config.metrics_num_episodes_for_smoothing
    logger.log_value("return_mean", ret, window=win)
    logger.log_value("return_min", ret, reduce="min", window=win)
    logger.log_value("return_max", ret, reduce="max", window=win)

Now I don’t think that is the best way to do it, when we do not known the episodes, especially during evaluation this then is off when win != len(episodes_seen).

Case 1: win < episodes_seen:
Only the the sub results episodes_seen[-win:] will actually be logged, meaning the first episodes samples are discarded without logging their results to min/mean/max.

Case 2: win > episodes_seen:
In that case old episodes from evaluation_interval iterations before are included in the result; mean/min/max will be influenced by those old episodes, e.g.

from ray.rllib.utils.metrics.metrics_logger import MetricsLogger
logger = MetricsLogger()

win = 10

# Medicore training for two eval periods

for ret in range(10):
    logger.log_value("return_mean", ret, window=win,)
    logger.log_value("return_min", ret, reduce="min", window=win)
    logger.log_value("return_max", ret, reduce="max", window=win)
print(logger.reduce())

# Assume some training progress much better model now:

for ret in range(100, 150, 10):
    logger.log_value("return_mean", ret, window=win)
    logger.log_value("return_min", ret, reduce="min", window=win)
    logger.log_value("return_max", ret, reduce="max", window=win)
print(logger.reduce())

Here the reported metrics will now be, which are (way) below the current performance of the mode.

mean: 63,5 < 100
min: 5 <<< 100

For evaluation this can be fixed when using a evaluation_config with evaluation_duration_unit: "episodes" & metrics_num_episodes_for_smoothing=evaluation_interval

But for training where one rather uses timesteps the amount of episodes is not constant and when I have a large batch_size I likely have a larger amount of completed episodes, meaning that during logging some episode rewards are not included in the return_mean (Case 1) and if I have not many completed episodes old metrics can bleed over (Case 2) in the worst case even over multiple iterations.
Possibly some smoothing with the last iterations is fine, and the default metrics_num_episodes_for_smoothing=100 is fine.

However, if I am really just interested in the current iteration there is no way to currently get this result.
I would like to propose a metrics_num_episodes_for_smoothing="iteration" or a metrics_num_episodes_for_smoothing_unit: "iterations" | "episodes" value that has an infinite window and clears the gathered metrics on metrics.reduce() instead of an iteration-spanning window.

Topic		Replies	Views
[RLlib, Tune, PPO] episode_reward_mean based on new episodes for each iteration Configure Algorithm, Training, Evaluation, Scaling	1	47	November 25, 2024
Metrics Reporting Frequency RLlib	2	503	March 21, 2022
Custom metrics only mean value RLlib	3	895	February 16, 2022
How rllib train log the reward on tensorboard? RLlib	1	556	March 25, 2022
Possible to access default logger from environment? RLlib	15	1489	April 27, 2021

Change the episode metrics as the timeframe is a bit vague?

Related topics