How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Dear community,
I faced the topic that executing gym environment episodes with my trained policy using the Policy Checkpoint API Policy.from_checkpoint().compute_single_action
led to non-deterministic results. Precisely, every time I run a code block like that …
episode_reward = 0
terminated = truncated = False
obs, _ = env.reset()
action_sequence = []
while not terminated and not truncated:
action = my_restored_policy.compute_single_action(obs)
obs, reward, terminated, truncated, info = env.step(action[0])
episode_reward += reward
action_sequence.append(action[0])
print(f"\rTotal episode reward: {episode_reward}")
print(f"\rAction sequence: {action_sequence}")
… I may get different results for final reward.
One solution idea is deterministic action selection in the evaluation phase. Instead of sampling from the action distribution, I want to select the action with the highest probability from the masked action set.
Where should I integrate that into? I already read through How To Customize Policies — Ray 3.0.0.dev0, but I am a bit unsure how to apply that in PyTorch. Further, I have the feeling that potential changes in custom policy also interfere with the PPOConfig.evaluation()
settings.
Hence, following questions to the community:
- How do custom policy and / or custom evaluation function support the above-mentioned solution idea?
- How to incorporate that in old RLlib API stack? For multiple reasons, I am still using ray 2.10. and
TorchModelV2
.