Action Masking Model: Deterministic selection of the best action

PhilippWillms · August 11, 2024, 8:18pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Dear community,

I faced the topic that executing gym environment episodes with my trained policy using the Policy Checkpoint API Policy.from_checkpoint().compute_single_action led to non-deterministic results. Precisely, every time I run a code block like that …

episode_reward = 0
terminated = truncated = False
obs, _ = env.reset()
action_sequence = []

while not terminated and not truncated:
    action = my_restored_policy.compute_single_action(obs)
    obs, reward, terminated, truncated, info = env.step(action[0])
    episode_reward += reward
    action_sequence.append(action[0])
print(f"\rTotal episode reward: {episode_reward}")
print(f"\rAction sequence: {action_sequence}")

… I may get different results for final reward.

One solution idea is deterministic action selection in the evaluation phase. Instead of sampling from the action distribution, I want to select the action with the highest probability from the masked action set.

Where should I integrate that into? I already read through How To Customize Policies — Ray 3.0.0.dev0, but I am a bit unsure how to apply that in PyTorch. Further, I have the feeling that potential changes in custom policy also interfere with the PPOConfig.evaluation() settings.

Hence, following questions to the community:

How do custom policy and / or custom evaluation function support the above-mentioned solution idea?
How to incorporate that in old RLlib API stack? For multiple reasons, I am still using ray 2.10. and TorchModelV2.

Topic		Replies	Views
Policy.compute_single_action() wrong outputs RLlib	0	226	October 30, 2023
Compute_single_action randomly errors without changing input RLlib	0	243	October 16, 2023
RLLib: How to use policy learned in tune.run()? RLlib	6	995	September 21, 2023
Score the trained policy by ray RLlib	2	310	June 25, 2021
Not able to locate rllib train function code RLlib	6	311	March 22, 2023

Action Masking Model: Deterministic selection of the best action

Related topics