I want to create a neural network with the same architecture as the policy network, the only difference would be that I want to replace a softmax on the last layer of the network with tanh.
Any help will be greatly appreciated!
I want to create a neural network with the same architecture as the policy network, the only difference would be that I want to replace a softmax on the last layer of the network with tanh.
Any help will be greatly appreciated!
Hi @dev1dze,
Which kind of algorithm are you using?
Hi Mannyv and thanks for your reply!
I want to use the PPO.
There is no softmax in the network used for the actor in PPO. The network outputs the activation of the last layer unmodified as logits. There is no activation function applied.
The softmax is applied inside the Categorical action distribution used for Discrete outputs. You could create a custom action distribution to use instead but you will also have to define entropy, kl, log_prob etc for the distribution. Rllib does have SquashedGaussian distribution you can use with Box outputs but not Discrete.
Thanks for your answer,
Let me try maybe explain it better: I am not trying to define a new policy network and use it to train the agent instead of the one currently used by the algorithm. I just want to use this network to learn some reward/cost functions for sampled states and actions in the sampled trajectory/episode.
That is why I would like to have a reward network with the same architecture as a policy network (with only difference not having softmax on the output layer) and with a custom loss function to be updated separately after the policy update.
if you can specify an example code or the file where I should be looking for it would be a great help.
Thanks a lot!