Output from custom policy network for PPO

Jerome-Cong · October 26, 2022, 8:46pm

I’m trying to implement a custom model and training with PPO. When dealing with discrete action space, generally we want to apply a softmax to get the distribution over actions. In ppo_torch_policy.py it says that the logits should be derived from the model.

My question is: Does the exploration apply softmax to the logits later internally and we only need to output the raw logits from the policy network?

I’m sorry it’s a bit tricky to look for PPO implementation details with RLlib

arturn · November 15, 2022, 9:08am

Hi @Jerome-Cong ,

Logits are indeed derived from the model.
We then apply the action distribution, which will normally be a TorchCategorical, where we can input logits already. Have a look at torch.distributions.categorical.Categorical - it does this for us!

Topic		Replies	Views
Output of PPO with discrete actions RLlib	4	1125	December 15, 2022
[rllib] Retrieve and modify the computed discrete action logits to PPO agent RLlib	6	716	May 5, 2021
Next action in RLlib VisionNetworks RLlib	4	499	April 27, 2021
Policy.compute_log_likelihoods should allows to compute with/without applying the exploration (e.g. SoftQ exploration) RLlib	1	271	April 16, 2021
TorchMultiCategorical with logits calculated in the constructor RLlib	6	488	October 6, 2021

Output from custom policy network for PPO

Related topics