Hey folks,

I have problems in understanding the following important note (it’s a comment in the RLlib’s Trainer config section “Evaluation Settings”):

IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting “explore=False” here will result in the evaluation workers not using this optimal policy!

Does this mean that if I wanna deploy a learned (optimal) policy, I would have to set `"explore=True"`

and not, as I’d expected, `"explore=False"`

?!?

Perhaps, I lack knowledge about what a stochastic (optimal) policy really is

So far, I’ve thought that `"explore=False"`

means computing deterministic (optimal) actions from the learned policy (i.e. “max action logit”) and that `"explore=True"`

means computing stochastic actions from the learned policy (i.e. “sample from action logits”).

Can anyone shed some light on this?