Training a custom agent against a random policy

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m working on implementing 2-player self-play with PPO as in this example. Initially, the agent that is being trained plays against RandomPolicy until it wins enough times. It looks like the RandomPolicy takes a sample from the action space of my custom environment. For reference, here is RLLib’s RandomPolicy compute_actions() function:

return (
    [self.action_space_for_sampling.sample() for _ in range(obs_batch_size)],

If I use the config in the self-play example, these values will be unsquashed before being sent to the environment. This is because since normalize_actions is defaulted to be True.

config = (
        # normalize_actions=True by default

This means that the random sample, which was already in the action space of the environment, is unsquashed (and possibly clipped) unnecessarily. My action space is [0, 1], so after unsquashing, the only possible actions supplied by the random agent is [0.5, 1], which is completely unideal.

If I set normalize_actions to False, then it solves this problem while breaking the other agent. The non-random agent will output an action from [-1, 1], and without the unsquashing step, it often gets flagged by my environment as illegal. Ideally it would be unsquashed to [0, 1].

What is the best way to solve this? A couple options could be writing my own random policy or changing my environment to independently unsquash actions, but neither of these seem clean. Am I missing anything?