I am experimenting with a program using DQN to control traffic lights with SUMO-RL Environtment. I noticed there is a different between episode_return_mean of each iteration and episode_rewards.
When I print out results[‘env_runners’][‘episode_return_mean’] of each iteration, the value are [-324.85, -249.24, -152.9275, -127.304, -127.26833, -106.62375, -100.06777, -85.0690, -78.4025, -73.9646, -76.87, -78.1725, -72.2861, -68.92, -65.8255]
But after the training process is finished get results[‘env_runners’][‘hist_stats’][‘episode_reward’] with values: [-324.85, -173.63, -103.52, -9.71, -24.81, -127.09, -52.44, -36.94, -47.62, -25.08, -10.07, -5.07, -20.71, -70.3, -121.21, -97.71, -43.54, -6.85, -8.34, -7.02]
I don’t understand the difference between episode_return_mean and episode_reward. Please explain it to me. Thanks a lot.