I wonder if anyone knows how to compute multiple actions out of the trained policy?
compute_single_action seems to always return the greedy action. So, how can we get non-greedy actions, like the second-best action?
This is a topic where policies can fall into two categories.
Deterministic and stochastic. For deterministic policies this is straight forward, since you can simply hand
explore=False as an argument to
compute_single_action and also set it in your config.
This will make the policy choose exploratory actions from time to time, which will depend on the exploration strategy (there are a couple of them).
For stochastic policies, you can switch of exploration, but they might still have learned an optimal stochastic policy, and therefore are not meant to always choose the action that is at the argmax of your action distribution.