How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hello, does the action distribution class used by PPO for a discrete action space default to class Categorical(TFActionDistribution) ? Would the output of this be probabilities or just logits with their raw values? - Shouldn’t the policy outputs be stochastic in some way?
I’m confused about my policy’s output. Please advise.
For clarification - is there a softmax in the last layer by default? or does it use raw values for PPO? - That is what the question boils down to I think.
There is no softmax in the model computed by the policy. It outputs logits.
During rollouts in the sample phase of training those are passed to an action distribution appropriate for the type of action space to generate the actions to pass to the environment. In a Discrete action space this would be a categorical distribution which would apply a soft ax.
During training the logits would parameterize an action distribution that is then used to compute the surrogate_loss and the kl_loss. The first uses the logp of the current model on the observed actions. The second calculates the kl divergence between the action distribution parameterized by the model during rollouts and those of the current model.
Hello, sorry for the late response. So the outputs are logits (raw values) during training, and during the sample phase of training, since my action space is discrete, it will pick the “categorical distribution” which will in turn apply the softmax on these logits, and the action with the highest value of logit is picked as the action taken - I understood this part, I think.
The second part about parametrizing the action distribution is completely lost on me because I can only imagine that for the case where action space is not discrete, but continuous? Please advise if this is the case?
Coming to what I am trying to do - keeping the action space discrete, I need to create a probability of picking the action proportional to the value of logits instead of softmax over those logits - does that already exist, or will I have to create a custom action distribution?
You can see parametrizing is a “fancy” word for the fact that the action distribution is created as follows (the line from Manny’s link):
curr_action_dist = dist_class(logits, model)
Hence, it depends on the logits.
Coming to what I am trying to do - keeping the action space discrete, I need to create a probability of picking the action proportional to the value of logits instead of softmax over those logits - does that already exist, or will I have to create a custom action distribution?
If you want an action distribution that is proportional to the logits, that’s easy:
The probability will be the logit for the discrete action n, divided by the sum of all logits.
This is ordinary normalization and fulfills the same purpose as softmax but is less numerically stable and thus you’d normally not use it. You’ll have to code this yourself, RLlib offers examples of distributions that you can copy and modify to your liking