How does StochasticSampling work?

2dm · June 14, 2022, 9:14pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi,

I am trying to understand how StochasticSampling works during training and evaluation.

From this and this posts I understand that it supposed to take the model output (I use random_timesteps=0), add some kind of noise , and then sample an action from it. However, I am very confused about what is actually happens.
My questions:

Where in the code is this noise added?
Are there any properties to this noise?
In StochasticSampling class documentation there is: Also allows for scheduled parameters for the distributions, such as lowering stddev, temperature, etc.. over time. The only example I found was SoftQ that overrides get_exploration_action to add the temperature. Is there another example where stddev (is that the std of the noise?) is used?
In evaluation - if I use "explore": True in order to keep the policy stochastic, are the actions generated from the policy output alone (without argmax), or does StochasticSampling also affects their generation in this mode?
In PPO - does the entropy added to the loss “adds on top of” the exploration mechanism (particularly StochasticSampling) that is used? i.e. are there more randomly generated actions in this case because the agent has 2 sources that contribute to the exploration?

Thank you!

christy · June 16, 2022, 5:40am

Hi there!

Would you like to ask your question in RLlib Office Hours? It sounds like a good topic!

Just add discuss link to your question to this doc: RLlib Office Hours - Google Docs

Thanks! Hope to see you there!

carlorop · June 22, 2022, 1:20pm

Are these Office Hours recorded? I would be very interested in knowing the answers for these questions

2dm · June 22, 2022, 4:32pm

I couldn’t make it to the last office hours to ask there, so you won’t find the answers in the recordings (they do record it, link is in the google doc @christy shared).

I still hope to get some help here.

arturn · June 27, 2022, 3:07pm

Hi @carlorop ,

Also in reference to your other post: StochasticSampling will, if used in an algorithm, be called in the policies, like here.
The distribution of the actions (and therefore of the noise if you will) is parameterized by the outputs of your model. For example, for a guassian diagonal of size l, 2*l outputs of the model will be needed to parameterize this distribution.
The parameters of the an exploratory action sampling step depend on the distribution used. Stddev is one of these.
A policy includes the stochastic sampling step in it’s compute_action methods. Therefore, choosing explore=True will be lead to output of an exploratory action.
The entropy is calculated based on the distribution parameters output by your model. This way, the entropy loss can efficiently control the variance.

Topic		Replies	Views
Meaning of StochasticSampling for exploration RLlib	6	683	February 16, 2022
Decay of StochasticSampling RLlib	2	568	June 9, 2022
All or nothing (Explore or sample) actions - correct for each step? RLlib	2	283	October 6, 2022
Parameter noise exploration and policy gradient / actor critic RLlib	0	290	September 27, 2022
[rllib] Retrieve and modify the computed discrete action logits to PPO agent RLlib	6	702	May 5, 2021

How does StochasticSampling work?

Related topics