Evaluation reward lower than training when exploration is zero

fedetask · August 23, 2023, 11:45am

I’m training an agent with APEX using the following exploration config:

            "exploration_config": {
                "type": "EpsilonGreedy",
                "initial_epsilon": 0.2,
                "final_epsilon": 0.00,
                "warmup_timesteps": 5e4, 
                "epsilon_timesteps": 6e5,
            },

At 600k steps the exploration rate becomes 0, so I would expect the evaluation reward curve to converge to the training curve. Instead, what I observe is the following:

I know the training curve is an average over the multiple rollout workers and therefore smoother, but I would expect the average of the evaluation to be on par with it. Instead, it is consistently below that of training.

My environment does not have any stochasticity, and the training and evaluation environments are the same.

What am I missing? Is there something on APEX that maintains some level of stochasticity even when epsilon = 0? Or anything that could be modifying the behavior between training and evaluation?

Topic		Replies	Views
Test reward much lower than training reward RLlib	3	432	July 17, 2022
Evaluation run seems to not change at all, in any of my runs? RLlib	4	278	September 19, 2022
Using exploration during evaluation RLlib	4	897	January 5, 2022
Anomalous behaviour with some plateaus during training RLlib	0	10	November 28, 2024
Help with Reward Plateaus and Missing Initial Episodes in PG Algorithm Training RLlib	0	11	November 29, 2024

Evaluation reward lower than training when exploration is zero

Related topics