PPO order of actions/obs/rewards scrambled

Hello, I noticed that the order of actions, observations, and rewards when training with PPO seems to be scrambled (while PG for example is fine).

Specifically I created a custom single-agent env: it is a two-step game where the initial observation (upon calling reset) is 10, and observations on steps where an action was taken just mirror the action. The reward is always 0 on the first time step and is equal to the action value on the second step. The action space is Discrete(10) and the observation space is Discrete(11).

In the rllib/agents/ppo/ppo_tf_policy.py file I am printing train_batch[SampleBatch.OBS], train_batch[SampleBatch.ACTIONS], and train_batch[SampleBatch.REWARDS] at the very beginning of ppo_surrogate_loss.

I expect every odd parity element in observations to be 10 and the rewards to match up with the actions and observations.

Is this expected? If so how do I get PPO to maintain the correct order in rollouts? I have set "batch_mode": "complete_episodes" (and the train batch and sgd minibatch sizes are even numbers so there shouldn’t be any “rollover”) and "shuffle_sequences": False (not sure what this parameter does?) but that didn’t seem to fix the problem.

Here is a minimal reproducible script for reference. Thank you!

import os
from gym.spaces import Discrete
import numpy as np 
import gym
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.agents.pg import PGTrainer
from ray.tune.logger import pretty_print
import ray

class TwoStepGame(gym.Env):
    Simple two-step game:
    - init obs: 10 (signal game start)
    - obs: whatever action was taken in that round
    - reward: 0 on first step, action value on second step

    def __init__(self, env_config):
        self.action_space = Discrete(10)
        self.observation_space = Discrete(11)

    def reset(self):
        self.t = 0
        return 10

    def step(self, action):
        self.t += 1
        done = self.t == 2
        info = {}
        obs = action
        reward = action if self.t == 2 else 0
        return obs, reward, done, info

if __name__ == "__main__":
    config = {
        "env": TwoStepGame,
        "env_config": {},
        "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
        "framework": 'tfe',
        "num_workers": 1,
        # "rollout_fragment_length": 10,
        "train_batch_size": 10,
        "sgd_minibatch_size": 10,
        "num_sgd_iter": 1,
        "batch_mode": "complete_episodes",
        "shuffle_sequences": False,
        'seed': 0,
    agent = PPOTrainer(config, TwoStepGame)
    for i in range(10):
        result = agent.train()

Hi @varmichelle,

Welcome to the forum.

Yes this is expected behavior. They should be shuffled wrt the order of timesteps, but the transiton values (obs, action, reward,…). Should be consistently shuffled together. If you find they are not please do let us know.

The shuffling happens here:

That function has a shuffle argument with a default of True. But if you look where it is called from you will find that argument is not passed in so it will always be true.

@sven1977 or @gjoliver can comment on whether that is intentional or an oversight.

In the mean time you could edit that file manually to shuffle=False if you really need to disable it.