Getting deterministic policy after DQN training

Hi
I am tuning DQN and using best trial’s checkpoint to simulate episodes from learned policy (code below). I am getting different episode in each run with same policy, probably because the learned policy is stochastic. Could anyone please help me find a way I can make the action selection deterministic by always choosing best action at each step?

agent = dqn.DQNTrainer(
config=analysis.best_config,
env=select_env,
)
agent.restore(checkpoint_path)

done = False
obs = env.reset()
step = 0
while not done:
    action = agent.compute_action(obs)
    obs, reward, done, info = env.step(action)
    step += 1

I think you can do this, which forces determisitic action selection.

compute_actions(obs, explore=False)

You mean

agent.compute_action(obs, explore=False)?

I think so.

agent.compute_action(obs, explore=False)

There is a config key called “explore”. If you set it to False it will turn off any randomness in the actions.

config=analysis.best_config
config["explore"] = False
agent = dqn.DQNTrainer(
config=config,
env=select_env,
)
agent.restore(checkpoint_path)

done = False
obs = env.reset()
step = 0
while not done:
    action = agent.compute_action(obs)
    obs, reward, done, info = env.step(action)
    step += 1

@Saurabh_Arora Just a sanity check, you didn’t say what select_env is. Maybe you have randomness there as well.

@sven1977 , trainer class had following option

@PublicAPI
def compute_action(self,
observation: TensorStructType,
state: List[TensorStructType] = None,
prev_action: TensorStructType = None,
prev_reward: float = None,
info: EnvInfoDict = None,
policy_id: PolicyID = DEFAULT_POLICY_ID,
full_fetch: bool = False,
explore: bool = None) → TensorStructType:

so I used explore flag as False to make policy deterministic. Please correct if it does not achieve the same affect.