How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I’m working on implementing 2-player self-play with PPO as in this example. Initially, the agent that is being trained plays against RandomPolicy until it wins enough times. It looks like the RandomPolicy takes a sample from the action space of my custom environment. For reference, here is RLLib’s RandomPolicy compute_actions()
function:
return (
[self.action_space_for_sampling.sample() for _ in range(obs_batch_size)],
[],
{},
)
If I use the config in the self-play example, these values will be unsquashed before being sent to the environment. This is because since normalize_actions
is defaulted to be True
.
config = (
PPOConfig()
.environment(
env="open_spiel_env",
# normalize_actions=True by default
)
...
)
This means that the random sample, which was already in the action space of the environment, is unsquashed (and possibly clipped) unnecessarily. My action space is [0, 1], so after unsquashing, the only possible actions supplied by the random agent is [0.5, 1], which is completely unideal.
If I set normalize_actions
to False
, then it solves this problem while breaking the other agent. The non-random agent will output an action from [-1, 1], and without the unsquashing step, it often gets flagged by my environment as illegal. Ideally it would be unsquashed to [0, 1].
What is the best way to solve this? A couple options could be writing my own random policy or changing my environment to independently unsquash actions, but neither of these seem clean. Am I missing anything?