Deploying a learned policy under "explore=False / True"

Hey folks,
I have problems in understanding the following important note (it’s a comment in the RLlib’s Trainer config section “Evaluation Settings”):

IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting “explore=False” here will result in the evaluation workers not using this optimal policy!

Does this mean that if I wanna deploy a learned (optimal) policy, I would have to set "explore=True" and not, as I’d expected, "explore=False"?!?

Perhaps, I lack knowledge about what a stochastic (optimal) policy really is :thinking:
So far, I’ve thought that "explore=False" means computing deterministic (optimal) actions from the learned policy (i.e. “max action logit”) and that "explore=True" means computing stochastic actions from the learned policy (i.e. “sample from action logits”).
Can anyone shed some light on this?

2 Likes

same problem, did you get any results?

@hossein836 I don’t know more than 3 months ago when I’ve opened this topic, sorry :person_shrugging:
I guess it’s like the comment says, if your optimal policy is a stochastic one (i.e. you can either go left or right to reach the optimum and not only left) then you better should enable action sampling also in the deployment phase. If you do well with "explore=False", then your policy rather be a determenistic one.

Sorry for this rather helpless answer, but I don’t know it better. What are your experiences/observations so far? What’s your use case?

Hi @klausk55, hi @hossein836 ,

I guess it’s like the comment says, if your optimal policy is a stochastic one (i.e. you can either go left or right to reach the optimum and not only left) then you better should enable action sampling also in the deployment phase.

This is correct and the comment in RLlib’s Trainer config seems counterintuitive, but that’s how it is. That’s why you will see explore=False in the Q Learning evaluation configs and explore=True in the PG ones. A stochastic policy can be optimal, and if it’s optimal you degrade performance by changing it (for example by making it deterministic).

1 Like

Hi @arturn,

Thanks for your additional explanations!
You mention that for the PG algos default evaluation config is explore=False. Does this mean that policies learned by PG algos aren’t stochastic? :thinking:

Hi @klausk55 ,

Sorry, I flipped True and False there. It’s exactly the other way around. I changed my answer!

1 Like

Hi @arturn,

are there any indicators which point out whether an optimal policy will rather be stochatic or deterministic?
Or is it simply trial-and-error, i.e. deploy the learned policy with explore=False and explore=True and see if it makes any differences in performance?

The indicator for a stochastic policy that it could be replaced by a deterministic one would be that the uncertainty of your model is constantly extremely low. That is, the action distribution from which you sample is extremely narrow.

To put it even more clearly: A common action distribution for a single independent action is a one-dimensional normal distribution. Your model outputs a mu and a sigma for you to sample from the normal distribution: action = action_mu + action_sigma * np.random.standard_normal().
If your model is extremely sure about an action, it will output an action_sigma close to zero and an action_mu that you could choose deterministically and end up with almost the same policy.
You can also run trials and observe, but the difference between an extremely “sure” stochastic policy and a deterministic one might be very subtle and only show on very very specific problems.

The default is therefore to let stochastic policies be stochastic during evaluation so they perform well on problems with an optimal stochastic policy and on problems with an optimal deterministic policy without you changing anything.

1 Like

@arturn is right. I also want to mention that setting explore to stochastic implicitly make the agent to explore more. there are some articles that compare explore settings, from what I’ve read I recommend that you often make more explore in the beginning and make it lower with time passed. to do this you can use entropy coefficient and reduce it with schedule or use your custom exploration with subclassing explore class.
you can use ideas like adaptive clipping and more [clipping](Decaying Clipping Range in Proximal Policy Optimization | IEEE Conference Publication | IEEE Xplore)

1 Like

Hi,

Just wanted to add that things look different if you use entropy regularization. This will yield a policy that is optimized on a target that represents not only the optimal behaviour but also incorporates the entropy loss.