Hi
I am tuning DQN and using best trial’s checkpoint to simulate episodes from learned policy (code below). I am getting different episode in each run with same policy, probably because the learned policy is stochastic. Could anyone please help me find a way I can make the action selection deterministic by always choosing best action at each step?
agent = dqn.DQNTrainer(
config=analysis.best_config,
env=select_env,
)
agent.restore(checkpoint_path)
done = False
obs = env.reset()
step = 0
while not done:
action = agent.compute_action(obs)
obs, reward, done, info = env.step(action)
step += 1
Bam4d
May 21, 2021, 12:43pm
2
I think you can do this, which forces determisitic action selection.
compute_actions(obs, explore=False)
You mean
agent.compute_action(obs, explore=False)?
Bam4d
May 21, 2021, 12:48pm
4
I think so.
agent.compute_action(obs, explore=False)
There is a config key called “explore”. If you set it to False it will turn off any randomness in the actions.
config=analysis.best_config
config["explore"] = False
agent = dqn.DQNTrainer(
config=config,
env=select_env,
)
agent.restore(checkpoint_path)
done = False
obs = env.reset()
step = 0
while not done:
action = agent.compute_action(obs)
obs, reward, done, info = env.step(action)
step += 1
@Saurabh_Arora Just a sanity check, you didn’t say what select_env is. Maybe you have randomness there as well.
@sven1977 , trainer class had following option
@PublicAPI
def compute_action(self,
observation: TensorStructType,
state: List[TensorStructType] = None,
prev_action: TensorStructType = None,
prev_reward: float = None,
info: EnvInfoDict = None,
policy_id: PolicyID = DEFAULT_POLICY_ID,
full_fetch: bool = False,
explore: bool = None) → TensorStructType:
so I used explore flag as False to make policy deterministic. Please correct if it does not achieve the same affect.