PPO order of actions/obs/rewards scrambled

varmichelle · January 15, 2022, 10:36pm

Hello, I noticed that the order of actions, observations, and rewards when training with PPO seems to be scrambled (while PG for example is fine).

Specifically I created a custom single-agent env: it is a two-step game where the initial observation (upon calling reset) is 10, and observations on steps where an action was taken just mirror the action. The reward is always 0 on the first time step and is equal to the action value on the second step. The action space is Discrete(10) and the observation space is Discrete(11).

In the rllib/agents/ppo/ppo_tf_policy.py file I am printing train_batch[SampleBatch.OBS], train_batch[SampleBatch.ACTIONS], and train_batch[SampleBatch.REWARDS] at the very beginning of ppo_surrogate_loss.

I expect every odd parity element in observations to be 10 and the rewards to match up with the actions and observations.

Is this expected? If so how do I get PPO to maintain the correct order in rollouts? I have set "batch_mode": "complete_episodes" (and the train batch and sgd minibatch sizes are even numbers so there shouldn’t be any “rollover”) and "shuffle_sequences": False (not sure what this parameter does?) but that didn’t seem to fix the problem.

Here is a minimal reproducible script for reference. Thank you!

import os
from gym.spaces import Discrete
import numpy as np 
import gym
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.agents.pg import PGTrainer
from ray.tune.logger import pretty_print
import ray


class TwoStepGame(gym.Env):
    """
    Simple two-step game:
    - init obs: 10 (signal game start)
    - obs: whatever action was taken in that round
    - reward: 0 on first step, action value on second step
    """

    def __init__(self, env_config):
        self.action_space = Discrete(10)
        self.observation_space = Discrete(11)

    def reset(self):
        self.t = 0
        return 10

    def step(self, action):
        self.t += 1
        done = self.t == 2
        info = {}
        obs = action
        reward = action if self.t == 2 else 0
        return obs, reward, done, info


if __name__ == "__main__":
    ray.init(local_mode=True)
    config = {
        "env": TwoStepGame,
        "env_config": {},
        "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
        "framework": 'tfe',
        "num_workers": 1,
        # "rollout_fragment_length": 10,
        "train_batch_size": 10,
        "sgd_minibatch_size": 10,
        "num_sgd_iter": 1,
        "batch_mode": "complete_episodes",
        "shuffle_sequences": False,
        'seed': 0,
    }
    agent = PPOTrainer(config, TwoStepGame)
    for i in range(10):
        result = agent.train()
        print(pretty_print(result))

mannyv · January 15, 2022, 11:02pm

Hi @varmichelle,

Welcome to the forum.

Yes this is expected behavior. They should be shuffled wrt the order of timesteps, but the transiton values (obs, action, reward,…). Should be consistently shuffled together. If you find they are not please do let us know.

The shuffling happens here:

github.com

ray-project/ray/blob/4a55d10bb1b70971f50a3872421f2c1eebd84e64/rllib/utils/sgd.py#L55

    
      
          if isinstance(samples, MultiAgentBatch):
              raise NotImplementedError(
                  "Minibatching not implemented for multi-agent in simple mode")
          
          
if "state_in_0" not in samples and "state_out_0" not in samples:
              samples.shuffle()
          
          
all_slices = samples._get_slice_indices(sgd_minibatch_size)
          data_slices, state_slices = all_slices
          
          
if len(state_slices) == 0:
              if shuffle:
                  random.shuffle(data_slices)
              for i, j in data_slices:
                  yield samples.slice(i, j)
          else:
              all_slices = list(zip(data_slices, state_slices))
              if shuffle:
                  # Make sure to shuffle data and states while linked together.
                  random.shuffle(all_slices)
              for (i, j), (si, sj) in all_slices:

That function has a shuffle argument with a default of True. But if you look where it is called from you will find that argument is not passed in so it will always be true.

github.com

ray-project/ray/blob/4a55d10bb1b70971f50a3872421f2c1eebd84e64/rllib/utils/sgd.py#L114

    
      
              # than max_seq_len otherwise this will cause indexing errors while
              # performing sgd when using a RNN or Attention model
              if policy.is_recurrent() and \
                 policy.config["model"]["max_seq_len"] > sgd_minibatch_size:
                  raise ValueError("`sgd_minibatch_size` ({}) cannot be smaller than"
                                   "`max_seq_len` ({}).".format(
                                       sgd_minibatch_size,
                                       policy.config["model"]["max_seq_len"]))
          
          
    for i in range(num_sgd_iter):
                  for minibatch in minibatches(batch, sgd_minibatch_size):
                      results = (local_worker.learn_on_batch(
                          MultiAgentBatch({
                              policy_id: minibatch
                          }, minibatch.count)))[policy_id]
                      learner_info_builder.add_learn_on_batch_results(
                          results, policy_id)
          
          
learner_info = learner_info_builder.finalize()
          return learner_info

@sven1977 or @gjoliver can comment on whether that is intentional or an oversight.

In the mean time you could edit that file manually to shuffle=False if you really need to disable it.

Topic		Replies	Views
How does `shuffle_sequences` work in PPO? RLlib	6	800	February 12, 2022
RLLIB PPO error on non-finished episodes RLlib	2	349	January 13, 2023
Custom RLModule Observation Tensor Random Sorting RLlib	0	25	June 22, 2025
Shuffling sequences with LSTM RLlib	1	718	July 31, 2021
Unable to replicate original PPO performance RLlib	0	179	May 10, 2024

PPO order of actions/obs/rewards scrambled

Related topics