Compute non-greedy actions out of the trained policy

deepgravity · June 3, 2022, 11:20pm

Hi all,
I wonder if anyone knows how to compute multiple actions out of the trained policy? compute_single_action seems to always return the greedy action. So, how can we get non-greedy actions, like the second-best action?
Thanks!

arturn · June 9, 2022, 11:10am

Hi @deepgravity,

This is a topic where policies can fall into two categories.
Deterministic and stochastic. For deterministic policies this is straight forward, since you can simply hand explore=False as an argument to compute_single_action and also set it in your config.
This will make the policy choose exploratory actions from time to time, which will depend on the exploration strategy (there are a couple of them).
For stochastic policies, you can switch of exploration, but they might still have learned an optimal stochastic policy, and therefore are not meant to always choose the action that is at the argmax of your action distribution.

Topic		Replies	Views
Score the trained policy by ray RLlib	2	310	June 25, 2021
Compute_single_action with explore=false returns the same result RLlib	2	95	August 20, 2024
Making the selection of action itself "stochastic" RLlib	12	943	October 3, 2022
Inconsistent actions from Algorithm.compute_single_action RLlib	3	420	June 14, 2023
Getting deterministic policy after DQN training RLlib	6	726	May 26, 2021

Compute non-greedy actions out of the trained policy

Related topics