Action Masking Model: Deterministic selection of the best action

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Dear community,

I faced the topic that executing gym environment episodes with my trained policy using the Policy Checkpoint API Policy.from_checkpoint().compute_single_action led to non-deterministic results. Precisely, every time I run a code block like that …

episode_reward = 0
terminated = truncated = False
obs, _ = env.reset()
action_sequence = []

while not terminated and not truncated:
    action = my_restored_policy.compute_single_action(obs)
    obs, reward, terminated, truncated, info = env.step(action[0])
    episode_reward += reward
    action_sequence.append(action[0])
print(f"\rTotal episode reward: {episode_reward}")
print(f"\rAction sequence: {action_sequence}")

… I may get different results for final reward.

One solution idea is deterministic action selection in the evaluation phase. Instead of sampling from the action distribution, I want to select the action with the highest probability from the masked action set.

Where should I integrate that into? I already read through How To Customize Policies — Ray 3.0.0.dev0, but I am a bit unsure how to apply that in PyTorch. Further, I have the feeling that potential changes in custom policy also interfere with the PPOConfig.evaluation() settings.

Hence, following questions to the community:

  1. How do custom policy and / or custom evaluation function support the above-mentioned solution idea?
  2. How to incorporate that in old RLlib API stack? For multiple reasons, I am still using ray 2.10. and TorchModelV2.