Getting deterministic policy after DQN training

Saurabh_Arora · May 21, 2021, 12:25pm

Hi
I am tuning DQN and using best trial’s checkpoint to simulate episodes from learned policy (code below). I am getting different episode in each run with same policy, probably because the learned policy is stochastic. Could anyone please help me find a way I can make the action selection deterministic by always choosing best action at each step?

agent = dqn.DQNTrainer(
config=analysis.best_config,
env=select_env,
)
agent.restore(checkpoint_path)

done = False
obs = env.reset()
step = 0
while not done:
    action = agent.compute_action(obs)
    obs, reward, done, info = env.step(action)
    step += 1

Bam4d · May 21, 2021, 12:43pm

I think you can do this, which forces determisitic action selection.

compute_actions(obs, explore=False)

Saurabh_Arora · May 21, 2021, 12:44pm

You mean

agent.compute_action(obs, explore=False)?

Bam4d · May 21, 2021, 12:48pm

I think so.

agent.compute_action(obs, explore=False)

mannyv · May 21, 2021, 3:36pm

There is a config key called “explore”. If you set it to False it will turn off any randomness in the actions.

config=analysis.best_config
config["explore"] = False
agent = dqn.DQNTrainer(
config=config,
env=select_env,
)
agent.restore(checkpoint_path)

done = False
obs = env.reset()
step = 0
while not done:
    action = agent.compute_action(obs)
    obs, reward, done, info = env.step(action)
    step += 1

RickLan · May 23, 2021, 4:19am

@Saurabh_Arora Just a sanity check, you didn’t say what select_env is. Maybe you have randomness there as well.

Saurabh_Arora · May 26, 2021, 1:44pm

@sven1977 , trainer class had following option

@PublicAPI
def compute_action(self,
observation: TensorStructType,
state: List[TensorStructType] = None,
prev_action: TensorStructType = None,
prev_reward: float = None,
info: EnvInfoDict = None,
policy_id: PolicyID = DEFAULT_POLICY_ID,
full_fetch: bool = False,
explore: bool = None) → TensorStructType:

so I used explore flag as False to make policy deterministic. Please correct if it does not achieve the same affect.

Topic		Replies	Views
Action Masking Model: Deterministic selection of the best action RLlib	0	27	August 11, 2024
Compute non-greedy actions out of the trained policy RLlib	1	468	June 9, 2022
Score the trained policy by ray RLlib	2	310	June 25, 2021
Deploying a learned policy under "explore=False / True" RLlib	9	1442	March 17, 2022
RLLib: How to use policy learned in tune.run()? RLlib	6	994	September 21, 2023

Getting deterministic policy after DQN training

Related topics