Hello, I noticed that the order of actions, observations, and rewards when training with PPO seems to be scrambled (while PG for example is fine).
Specifically I created a custom single-agent env: it is a two-step game where the initial observation (upon calling reset) is 10, and observations on steps where an action was taken just mirror the action. The reward is always 0 on the first time step and is equal to the action value on the second step. The action space is Discrete(10) and the observation space is Discrete(11).
In the rllib/agents/ppo/ppo_tf_policy.py
file I am printing train_batch[SampleBatch.OBS]
, train_batch[SampleBatch.ACTIONS]
, and train_batch[SampleBatch.REWARDS]
at the very beginning of ppo_surrogate_loss
.
I expect every odd parity element in observations to be 10 and the rewards to match up with the actions and observations.
Is this expected? If so how do I get PPO to maintain the correct order in rollouts? I have set "batch_mode": "complete_episodes"
(and the train batch and sgd minibatch sizes are even numbers so there shouldn’t be any “rollover”) and "shuffle_sequences": False
(not sure what this parameter does?) but that didn’t seem to fix the problem.
Here is a minimal reproducible script for reference. Thank you!
import os
from gym.spaces import Discrete
import numpy as np
import gym
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.agents.pg import PGTrainer
from ray.tune.logger import pretty_print
import ray
class TwoStepGame(gym.Env):
"""
Simple two-step game:
- init obs: 10 (signal game start)
- obs: whatever action was taken in that round
- reward: 0 on first step, action value on second step
"""
def __init__(self, env_config):
self.action_space = Discrete(10)
self.observation_space = Discrete(11)
def reset(self):
self.t = 0
return 10
def step(self, action):
self.t += 1
done = self.t == 2
info = {}
obs = action
reward = action if self.t == 2 else 0
return obs, reward, done, info
if __name__ == "__main__":
ray.init(local_mode=True)
config = {
"env": TwoStepGame,
"env_config": {},
"num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
"framework": 'tfe',
"num_workers": 1,
# "rollout_fragment_length": 10,
"train_batch_size": 10,
"sgd_minibatch_size": 10,
"num_sgd_iter": 1,
"batch_mode": "complete_episodes",
"shuffle_sequences": False,
'seed': 0,
}
agent = PPOTrainer(config, TwoStepGame)
for i in range(10):
result = agent.train()
print(pretty_print(result))