Deploying a learned policy under "explore=False / True"

Hey folks,
I have problems in understanding the following important note (it’s a comment in the RLlib’s Trainer config section “Evaluation Settings”):

IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting “explore=False” here will result in the evaluation workers not using this optimal policy!

Does this mean that if I wanna deploy a learned (optimal) policy, I would have to set "explore=True" and not, as I’d expected, "explore=False"?!?

Perhaps, I lack knowledge about what a stochastic (optimal) policy really is :thinking:
So far, I’ve thought that "explore=False" means computing deterministic (optimal) actions from the learned policy (i.e. “max action logit”) and that "explore=True" means computing stochastic actions from the learned policy (i.e. “sample from action logits”).
Can anyone shed some light on this?

1 Like