I’m trying to implement a custom model and training with PPO. When dealing with discrete action space, generally we want to apply a softmax to get the distribution over actions. In ppo_torch_policy.py it says that the logits should be derived from the model.
My question is: Does the exploration apply softmax to the logits later internally and we only need to output the raw logits from the policy network?
I’m sorry it’s a bit tricky to look for PPO implementation details with RLlib