How does StochasticSampling work?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.


I am trying to understand how StochasticSampling works during training and evaluation.

From this and this posts I understand that it supposed to take the model output (I use random_timesteps=0), add some kind of noise , and then sample an action from it. However, I am very confused about what is actually happens.
My questions:

  1. Where in the code is this noise added?
  2. Are there any properties to this noise?
  3. In StochasticSampling class documentation there is: Also allows for scheduled parameters for the distributions, such as lowering stddev, temperature, etc.. over time. The only example I found was SoftQ that overrides get_exploration_action to add the temperature. Is there another example where stddev (is that the std of the noise?) is used?
  4. In evaluation - if I use "explore": True in order to keep the policy stochastic, are the actions generated from the policy output alone (without argmax), or does StochasticSampling also affects their generation in this mode?
  5. In PPO - does the entropy added to the loss “adds on top of” the exploration mechanism (particularly StochasticSampling) that is used? i.e. are there more randomly generated actions in this case because the agent has 2 sources that contribute to the exploration?

Thank you!

Hi there! :wave:t3:

Would you like to ask your question in RLlib Office Hours? It sounds like a good topic!

:writing_hand:t3: Just add discuss link to your question to this doc: RLlib Office Hours - Google Docs

Thanks! Hope to see you there!

Are these Office Hours recorded? I would be very interested in knowing the answers for these questions

I couldn’t make it to the last office hours to ask there, so you won’t find the answers in the recordings (they do record it, link is in the google doc @christy shared).

I still hope to get some help here.

Hi @carlorop ,

  1. Also in reference to your other post: StochasticSampling will, if used in an algorithm, be called in the policies, like here.
  2. The distribution of the actions (and therefore of the noise if you will) is parameterized by the outputs of your model. For example, for a guassian diagonal of size l, 2*l outputs of the model will be needed to parameterize this distribution.
  3. The parameters of the an exploratory action sampling step depend on the distribution used. Stddev is one of these.
  4. A policy includes the stochastic sampling step in it’s compute_action methods. Therefore, choosing explore=True will be lead to output of an exploratory action.
  5. The entropy is calculated based on the distribution parameters output by your model. This way, the entropy loss can efficiently control the variance.