Self-play modifications via callbacks

How severe does this issue affect your experience of using Ray?

  • High: It blocks me from completing my task.

So, I am working on an idea involving self-play setups where the main policy’s reward is kind of weird. I’ll explain fully below.

So, let’s say we are training an agent to play a two-player game such as connect-4. Usually, we just train the main policy to maximize its score against randomly sampled snapshots of itself throughout training and this works well. However, if the game has a dominant strategy in it, it is likely that that’s what the main policy will learn (e.g., just running away in a single direction in 1v1 hide and seek).

Let’s say that instead, for each training step I wanted to sample k snapshots from the archive and play each of them against the main policy. At the end of each rollout, this would generate a k-dimensional (cumulative) reward vector r_t' = [r_1, ..., r_k] where r_1 is the reward in the 1v1 match between hider_1 and seeker_1, r_2 between hider_1 and seeker_2, etc. Then I could calculate the distance between this reward vectors and others saved in a buffer from previous training steps. So, e.g., r_t = ||r_t' - r_{0:t-1}'||. From there perform e.g., PPO as usual using r_t as the standard (sparse) reward sample even though it comes from several rollout’s worth of interactions.

How would I do that? I’ve been looking at the self-play code and it doesn’t seem very amenable to that change and for the life of me, I can’t figure it out at the moment.

Seems like I should need an evaluation call of the main policy vs each of the k sampled policies but I want this to be differentiable so then that tells me I don’t want it to be in the evaluation loop.

If I could get some pointers that’d be greatly appreciated.

After some digging, it seems like the correct thing for me to do is to override the on_postprocess_trajectory function. Experimenting with that now and will follow-up.

Okay, so I’m really close!

I am working on a self-play algorithm and say I want to run 5 rollouts versus my main policy of 5 randomly chosen snapshots in my archive. After those rollouts all finish, where could I intercept all that data showing me the rollout data for all 5 matches? I was experimenting with the on_postproces_trajectory but that being called at the end of every single rollout and so I don’t have access to all 5 at once like I need. However in on_postprocess_trajectory I do have access to postprocessed_batch which I would like to use for some reward shaping based on the 5 rollouts with different samples from the self-play archive. The postprocess function says: `

do additional policy postprocessing for a policy including looking at the trajectory data of other agents in multi-agent settings

however, I’m only seeing some of the necessary data in there not all of it because the function only has one episode and my multi-agent data comes from having different agent play different episodes b/c it’s through selfplay and not e.g., 4 agents all in the same env.I was then looking at the on_learn_on_batch but everything in there is all shuffled and only data from the agent that’s getting trained so that’s not quite working for me. (edited)

def post_process_trajectory: 
    # get reward info for all k rollouts with different snapshots vs main 
    rewards = [episode.agent_rewards[k] for k in agent_rewards.keys()] 
    # usually this list is only 1 element long because this callback only has access to 1 rollout 
    # do fancy reward thing using all k rollouts 
    reward = ... 
    # change rewards for `main` agent 
    postprocessed_batch['rewards'] = np.zeros_like(postprocessed_batch['rewards'])
    postprocessed_batch['rewards'][-1] = reward

Hi @aadharna,

I did a quick look at the source and I think you are going to have trouble getting access to the complete training batch before shuffling. on_learn_on_batch seems to be called in the sgd code after the timesteps have been permuted. I didn’t step through just did a static analysis of the code so I could have missed something.

I think the most straightforward way to do this is to create your own subclass of PPO that overrides training_step. Luckily this si easy to do since 2.2.

You would want to modify the existing one to add your logic to modify train_batch here:

Thanks for the pointer @mannyv! That’s the realization I came to last night as well. I’ll start poking around there today.

I’ll leave this gist here as well since it’s cleaner and has all the files necessary (I’m pretty sure) for a full look at what I’m doing: rllib self play adaptations · GitHub