Hi all,
I wonder if anyone knows how to compute multiple actions out of the trained policy? compute_single_action
seems to always return the greedy action. So, how can we get non-greedy actions, like the second-best action?
Thanks!
1 Like
Hi @deepgravity,
This is a topic where policies can fall into two categories.
Deterministic and stochastic. For deterministic policies this is straight forward, since you can simply hand explore=False
as an argument to compute_single_action
and also set it in your config.
This will make the policy choose exploratory actions from time to time, which will depend on the exploration strategy (there are a couple of them).
For stochastic policies, you can switch of exploration, but they might still have learned an optimal stochastic policy, and therefore are not meant to always choose the action that is at the argmax of your action distribution.
2 Likes