Strategy behind setting values of logp

Mateusz_Orlowski · April 14, 2021, 7:06am

Hi,

I have got a question regarding setting up values for logp during exploration. In case of StochasticSampling exploration, if the explore flag is set to True, the logp value is taken from sampled action and action distribution, which seems correct. In case of no exploration (explore == False), logp value is set to 0, which results in probability 1, which is as well correct (with current deterministic policy we are sure of selecting this action).

However what is not clear for me is why when we use some other exploration which adds something to the action (like GaussianNoise) we do set logp value to 0 as well? Will it not make more sense to calculate how exploration action fits to current action distribution?

Thanks in advance for any answers!

Mateusz

sven1977 · April 14, 2021, 7:36am

Hey @Mateusz_Orlowski , thanks for the question!
I think this was done for simplicity reasons. Algos that usually use GaussianNoise (i.e. TD3), don’t use the logp in their loss calculations.
But you are absolutely right, these are not the correct values. Please feel free to fix this and PR. Happy to change this.

Topic		Replies	Views
Policy.compute_log_likelihoods should allows to compute with/without applying the exploration (e.g. SoftQ exploration) RLlib	1	270	April 16, 2021
Meaning of StochasticSampling for exploration RLlib	6	694	February 16, 2022
How does StochasticSampling work? RLlib	4	982	June 27, 2022
[rllib] Retrieve and modify the computed discrete action logits to PPO agent RLlib	6	705	May 5, 2021
Decay of StochasticSampling RLlib	2	573	June 9, 2022

Strategy behind setting values of logp

Related topics