I’m training an agent with APEX using the following exploration config:
"exploration_config": {
"type": "EpsilonGreedy",
"initial_epsilon": 0.2,
"final_epsilon": 0.00,
"warmup_timesteps": 5e4,
"epsilon_timesteps": 6e5,
},
At 600k steps the exploration rate becomes 0, so I would expect the evaluation reward curve to converge to the training curve. Instead, what I observe is the following:
I know the training curve is an average over the multiple rollout workers and therefore smoother, but I would expect the average of the evaluation to be on par with it. Instead, it is consistently below that of training.
My environment does not have any stochasticity, and the training and evaluation environments are the same.
What am I missing? Is there something on APEX that maintains some level of stochasticity even when epsilon = 0
? Or anything that could be modifying the behavior between training and evaluation?