I have problems in understanding the following important note (it’s a comment in the RLlib’s Trainer config section “Evaluation Settings”):
IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting “explore=False” here will result in the evaluation workers not using this optimal policy!
Does this mean that if I wanna deploy a learned (optimal) policy, I would have to set
"explore=True" and not, as I’d expected,
Perhaps, I lack knowledge about what a stochastic (optimal) policy really is
So far, I’ve thought that
"explore=False" means computing deterministic (optimal) actions from the learned policy (i.e. “max action logit”) and that
"explore=True" means computing stochastic actions from the learned policy (i.e. “sample from action logits”).
Can anyone shed some light on this?