Output of PPO with discrete actions

hridayns · December 5, 2022, 3:45pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello, does the action distribution class used by PPO for a discrete action space default to class Categorical(TFActionDistribution) ? Would the output of this be probabilities or just logits with their raw values? - Shouldn’t the policy outputs be stochastic in some way?

I’m confused about my policy’s output. Please advise.

hridayns · December 8, 2022, 1:27pm

For clarification - is there a softmax in the last layer by default? or does it use raw values for PPO? - That is what the question boils down to I think.

mannyv · December 9, 2022, 8:19pm

Hi @hridayns,

There is no softmax in the model computed by the policy. It outputs logits.

During rollouts in the sample phase of training those are passed to an action distribution appropriate for the type of action space to generate the actions to pass to the environment. In a Discrete action space this would be a categorical distribution which would apply a soft ax.

During training the logits would parameterize an action distribution that is then used to compute the surrogate_loss and the kl_loss. The first uses the logp of the current model on the observed actions. The second calculates the kl divergence between the action distribution parameterized by the model during rollouts and those of the current model.

github.com

ray-project/ray/blob/129cbdba2da1a0345c4bf9d2e0aa2efe6357214f/rllib/algorithms/ppo/ppo_torch_policy.py#L88


      
          Args:
              model: The Model to calculate the loss for.
              dist_class: The action distr. class.
              train_batch: The training data.
          
          
Returns:
              The PPO loss tensor given the input batch.
          """
          
          
logits, state = model(train_batch)
          curr_action_dist = dist_class(logits, model)
          
          
# RNN case: Mask away 0-padded chunks at end of time axis.
          if state:
              B = len(train_batch[SampleBatch.SEQ_LENS])
              max_seq_len = logits.shape[0] // B
              mask = sequence_mask(
                  train_batch[SampleBatch.SEQ_LENS],
                  max_seq_len,
                  time_major=model.is_time_major(),
              )

hridayns · December 13, 2022, 8:30am

Hello, sorry for the late response. So the outputs are logits (raw values) during training, and during the sample phase of training, since my action space is discrete, it will pick the “categorical distribution” which will in turn apply the softmax on these logits, and the action with the highest value of logit is picked as the action taken - I understood this part, I think.

The second part about parametrizing the action distribution is completely lost on me because I can only imagine that for the case where action space is not discrete, but continuous? Please advise if this is the case?

Coming to what I am trying to do - keeping the action space discrete, I need to create a probability of picking the action proportional to the value of logits instead of softmax over those logits - does that already exist, or will I have to create a custom action distribution?

arturn · December 15, 2022, 9:00am

Hi @hridayns ,

You can see parametrizing is a “fancy” word for the fact that the action distribution is created as follows (the line from Manny’s link):

curr_action_dist = dist_class(logits, model)

Hence, it depends on the logits.

Coming to what I am trying to do - keeping the action space discrete, I need to create a probability of picking the action proportional to the value of logits instead of softmax over those logits - does that already exist, or will I have to create a custom action distribution?

If you want an action distribution that is proportional to the logits, that’s easy:
The probability will be the logit for the discrete action n, divided by the sum of all logits.
This is ordinary normalization and fulfills the same purpose as softmax but is less numerically stable and thus you’d normally not use it. You’ll have to code this yourself, RLlib offers examples of distributions that you can copy and modify to your liking

Topic		Replies	Views
Output from custom policy network for PPO RLlib	1	435	November 15, 2022
Next action in RLlib VisionNetworks RLlib	4	495	April 27, 2021
Policy Module (Model V2) RLlib	5	324	April 12, 2022
Fetch action probability distribution from trained policy RLlib	7	639	March 18, 2023
Does KL loss make sense when using action masking in PPO? RLlib	2	331	August 1, 2023

Output of PPO with discrete actions

Related topics