Strategy behind setting values of logp


I have got a question regarding setting up values for logp during exploration. In case of StochasticSampling exploration, if the explore flag is set to True, the logp value is taken from sampled action and action distribution, which seems correct. In case of no exploration (explore == False), logp value is set to 0, which results in probability 1, which is as well correct (with current deterministic policy we are sure of selecting this action).

However what is not clear for me is why when we use some other exploration which adds something to the action (like GaussianNoise) we do set logp value to 0 as well? Will it not make more sense to calculate how exploration action fits to current action distribution?

Thanks in advance for any answers!


Hey @Mateusz_Orlowski , thanks for the question!
I think this was done for simplicity reasons. Algos that usually use GaussianNoise (i.e. TD3), don’t use the logp in their loss calculations.
But you are absolutely right, these are not the correct values. Please feel free to fix this and PR. Happy to change this.