How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I am trying to understand how StochasticSampling works during training and evaluation.
From this and this posts I understand that it supposed to take the model output (I use random_timesteps=0), add some kind of noise , and then sample an action from it. However, I am very confused about what is actually happens.
- Where in the code is this noise added?
- Are there any properties to this noise?
- In StochasticSampling class documentation there is:
Also allows for scheduled parameters for the distributions, such as lowering stddev, temperature, etc.. over time.The only example I found was SoftQ that overrides
get_exploration_actionto add the temperature. Is there another example where stddev (is that the std of the noise?) is used?
- In evaluation - if I use
"explore": Truein order to keep the policy stochastic, are the actions generated from the policy output alone (without argmax), or does StochasticSampling also affects their generation in this mode?
- In PPO - does the entropy added to the loss “adds on top of” the exploration mechanism (particularly StochasticSampling) that is used? i.e. are there more randomly generated actions in this case because the agent has 2 sources that contribute to the exploration?