What's different between episode_return_mean of each iteration and episode_reward?

I am experimenting with a program using DQN to control traffic lights with SUMO-RL Environtment. I noticed there is a different between episode_return_mean of each iteration and episode_rewards.

When I print out results[‘env_runners’][‘episode_return_mean’] of each iteration, the value are [-324.85, -249.24, -152.9275, -127.304, -127.26833, -106.62375, -100.06777, -85.0690, -78.4025, -73.9646, -76.87, -78.1725, -72.2861, -68.92, -65.8255]

But after the training process is finished get results[‘env_runners’][‘hist_stats’][‘episode_reward’] with values: [-324.85, -173.63, -103.52, -9.71, -24.81, -127.09, -52.44, -36.94, -47.62, -25.08, -10.07, -5.07, -20.71, -70.3, -121.21, -97.71, -43.54, -6.85, -8.34, -7.02]

I don’t understand the difference between episode_return_mean and episode_reward. Please explain it to me. Thanks a lot.

@Do_Giang welcome to the forum and thanks for posting this question.

May I ask on which Ray version you run your experiment? The actual state of logging is:

res["env_runners"]["hist_stats"]["episode_reward"]

stores the history of reward sums per episode and

res["env_runners"]["episode_return_mean"]

defines the average of these episode reward sums. This might have been a bit different in older versions of Ray because there was still a hybrid stack which has been deprecated and should not be used. We recommend any user to switch to the new API stack (migration guide) as the old stack will be deprecated in the very near future.

As I checked, the Ray version using ray --version returns ray, version 2.37.0 . I’m going to try the new API stack that you suggested above.

Thanks for your support.

1 Like