Training a custom agent against a random policy

rohildshah · November 14, 2023, 6:38am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m working on implementing 2-player self-play with PPO as in this example. Initially, the agent that is being trained plays against RandomPolicy until it wins enough times. It looks like the RandomPolicy takes a sample from the action space of my custom environment. For reference, here is RLLib’s RandomPolicy compute_actions() function:

return (
    [self.action_space_for_sampling.sample() for _ in range(obs_batch_size)],
    [],
    {},
  )

If I use the config in the self-play example, these values will be unsquashed before being sent to the environment. This is because since normalize_actions is defaulted to be True.

config = (
    PPOConfig()
    .environment(
        env="open_spiel_env",
        # normalize_actions=True by default
    )
    ...
)

This means that the random sample, which was already in the action space of the environment, is unsquashed (and possibly clipped) unnecessarily. My action space is [0, 1], so after unsquashing, the only possible actions supplied by the random agent is [0.5, 1], which is completely unideal.

If I set normalize_actions to False, then it solves this problem while breaking the other agent. The non-random agent will output an action from [-1, 1], and without the unsquashing step, it often gets flagged by my environment as illegal. Ideally it would be unsquashed to [0, 1].

What is the best way to solve this? A couple options could be writing my own random policy or changing my environment to independently unsquash actions, but neither of these seem clean. Am I missing anything?

Topic		Replies	Views
Actions created by Policy being modified before input to environment RLlib	4	289	March 15, 2023
Inconsistent actions from Algorithm.compute_single_action RLlib	3	398	June 14, 2023
Implementing _forward() Method in PPO Custom Multi-Agent Shared Policy RLlib	1	40	February 19, 2025
PPO Policy not respecting action-space bounds RLlib	0	44	June 27, 2024
Do multi-agent environments need to specify an "action_space"? Configure Algorithm, Training, Evaluation, Scaling	11	88	April 7, 2025

Training a custom agent against a random policy

Related topics