Deploying a learned policy under "explore=False / True"

klausk55 · December 8, 2021, 5:42pm

Hey folks,
I have problems in understanding the following important note (it’s a comment in the RLlib’s Trainer config section “Evaluation Settings”):

IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting “explore=False” here will result in the evaluation workers not using this optimal policy!

Does this mean that if I wanna deploy a learned (optimal) policy, I would have to set "explore=True" and not, as I’d expected, "explore=False"?!?

Perhaps, I lack knowledge about what a stochastic (optimal) policy really is
So far, I’ve thought that "explore=False" means computing deterministic (optimal) actions from the learned policy (i.e. “max action logit”) and that "explore=True" means computing stochastic actions from the learned policy (i.e. “sample from action logits”).
Can anyone shed some light on this?

hossein836 · March 7, 2022, 12:35pm

same problem, did you get any results?

klausk55 · March 9, 2022, 9:07am

@hossein836 I don’t know more than 3 months ago when I’ve opened this topic, sorry
I guess it’s like the comment says, if your optimal policy is a stochastic one (i.e. you can either go left or right to reach the optimum and not only left) then you better should enable action sampling also in the deployment phase. If you do well with "explore=False", then your policy rather be a determenistic one.

Sorry for this rather helpless answer, but I don’t know it better. What are your experiences/observations so far? What’s your use case?

arturn · March 10, 2022, 11:01am

Hi @klausk55, hi @hossein836 ,

I guess it’s like the comment says, if your optimal policy is a stochastic one (i.e. you can either go left or right to reach the optimum and not only left) then you better should enable action sampling also in the deployment phase.

This is correct and the comment in RLlib’s Trainer config seems counterintuitive, but that’s how it is. That’s why you will see explore=False in the Q Learning evaluation configs and explore=True in the PG ones. A stochastic policy can be optimal, and if it’s optimal you degrade performance by changing it (for example by making it deterministic).

klausk55 · March 10, 2022, 12:41pm

Hi @arturn,

Thanks for your additional explanations!
You mention that for the PG algos default evaluation config is explore=False. Does this mean that policies learned by PG algos aren’t stochastic?

arturn · March 10, 2022, 12:49pm

Hi @klausk55 ,

Sorry, I flipped True and False there. It’s exactly the other way around. I changed my answer!

klausk55 · March 10, 2022, 1:30pm

Hi @arturn,

are there any indicators which point out whether an optimal policy will rather be stochatic or deterministic?
Or is it simply trial-and-error, i.e. deploy the learned policy with explore=False and explore=True and see if it makes any differences in performance?

arturn · March 10, 2022, 6:13pm

The indicator for a stochastic policy that it could be replaced by a deterministic one would be that the uncertainty of your model is constantly extremely low. That is, the action distribution from which you sample is extremely narrow.

To put it even more clearly: A common action distribution for a single independent action is a one-dimensional normal distribution. Your model outputs a mu and a sigma for you to sample from the normal distribution: action = action_mu + action_sigma * np.random.standard_normal().
If your model is extremely sure about an action, it will output an action_sigma close to zero and an action_mu that you could choose deterministically and end up with almost the same policy.
You can also run trials and observe, but the difference between an extremely “sure” stochastic policy and a deterministic one might be very subtle and only show on very very specific problems.

The default is therefore to let stochastic policies be stochastic during evaluation so they perform well on problems with an optimal stochastic policy and on problems with an optimal deterministic policy without you changing anything.

hossein836 · March 12, 2022, 7:29am

@arturn is right. I also want to mention that setting explore to stochastic implicitly make the agent to explore more. there are some articles that compare explore settings, from what I’ve read I recommend that you often make more explore in the beginning and make it lower with time passed. to do this you can use entropy coefficient and reduce it with schedule or use your custom exploration with subclassing explore class.
you can use ideas like adaptive clipping and more [clipping](Decaying Clipping Range in Proximal Policy Optimization | IEEE Conference Publication | IEEE Xplore)

arturn · March 17, 2022, 2:55pm

Hi,

Just wanted to add that things look different if you use entropy regularization. This will yield a policy that is optimized on a target that represents not only the optimal behaviour but also incorporates the entropy loss.

Topic		Replies	Views
Making the selection of action itself "stochastic" RLlib	12	943	October 3, 2022
RLLib: How to use policy learned in tune.run()? RLlib	6	994	September 21, 2023
Compute non-greedy actions out of the trained policy RLlib	1	468	June 9, 2022
Getting deterministic policy after DQN training RLlib	6	726	May 26, 2021
All or nothing (Explore or sample) actions - correct for each step? RLlib	2	292	October 6, 2022

Deploying a learned policy under "explore=False / True"

Related topics