Output from custom policy network for PPO

I’m trying to implement a custom model and training with PPO. When dealing with discrete action space, generally we want to apply a softmax to get the distribution over actions. In ppo_torch_policy.py it says that the logits should be derived from the model.

My question is: Does the exploration apply softmax to the logits later internally and we only need to output the raw logits from the policy network?

I’m sorry it’s a bit tricky to look for PPO implementation details with RLlib

Hi @Jerome-Cong ,

Logits are indeed derived from the model.
We then apply the action distribution, which will normally be a TorchCategorical, where we can input logits already. Have a look at torch.distributions.categorical.Categorical - it does this for us! :slight_smile: