How to concat rollout batches before computing GAE?

GattiPinheiro · June 30, 2021, 3:22pm

Hello,

I would like to know how to concat rollout batches before computing GAE.

I’m trying to change PPO to use the average reward setting instead of the discounted formulation.
In other words, I want to compute the TD-errors as

where is an estimation of the average reward for policy $\pi$, independent of starting state $S_0$.
Everything else stays exactly the same.

In my understanding, the simplest way to implemented it is to re-compute all rewards after rollout collection by subtracting the average of collected rewards. In other words, I would like to perform the following

perform rollouts (e.g., compute $\pi(a|s)$, $v(s)$ and env#step)
concat batches
compute average reward (e.g. )
re-compute rewards (e.g., )
compute GAE as usual
back prop as usual

My issue is that, by construction, step 5. occurs before step 2. and I don’t see any way to reverse them. How can I implement such algorithm with RLLib? Is that any way to overwrite PPO’s default behavior?

michaelzhiluo · July 1, 2021, 7:29am

Before you passing in the training batch to the loss function, you have to postprocess your training batch. I’m pretty sure you will have value function predictions in the batch b/c it is in the policy code for both PPO torch and tensorflow policy. (There you can add a new key to the train batch dict)

To do this, override postprocess_fn, which is better described here: ray/policy_template.py at master · ray-project/ray · GitHub

GattiPinheiro · July 5, 2021, 6:42am

This is precisely what I did as a first version…

As I explained in the question, the problem with this strategy is that not all data is provided to postprocess_fn. The SampleBatch input to postprocess_fn contains only the data from the current worker. I would like to have access to the data from all rollout workers to compute the average reward, before computing GAE.

mannyv · July 5, 2021, 11:27am

Hi @GattiPinheiro

The painful way is to modify the execution plan right about here :

github.com

ray-project/ray/blob/7c21be545037ec5be09e41d49badb3ba0b4469e3/rllib/agents/ppo/ppo.py#L272

    
      
          
          
# Collect batches for the trainable policies.
          rollouts = rollouts.for_each(
              SelectExperiences(workers.trainable_policies()))
          # Concatenate the SampleBatches into one.
          rollouts = rollouts.combine(
              ConcatBatches(
                  min_batch_size=config["train_batch_size"],
                  count_steps_by=config["multiagent"]["count_steps_by"],
              ))
          # Standardize advantages.
          rollouts = rollouts.for_each(StandardizeFields(["advantages"]))
          
          
# Perform one training step on the combined + standardized batch.
          if config["simple_optimizer"]:
              train_op = rollouts.for_each(
                  TrainOneStep(
                      workers,
                      num_sgd_iter=config["num_sgd_iter"],
                      sgd_minibatch_size=config["sgd_minibatch_size"]))
          else:

from ray.rllib.agents.ppo import PPOTrainer

def my_execution_plan:
   pass

newPPOTrainer = PPOTrainer.with_updates(execution_plan=new_execution_plan)

GattiPinheiro · July 7, 2021, 2:21pm

Thank you for the feedback.

Well, this option did cross my mind, and indeed it seems feasible. The only thing I don’t like about it is that GAE will be performed twice (before applying the average reward and then a second time after it).

I was reading the code, and I believe that the only way to work around this double computation is to overwrite the postprocess_fn as well (so it does not do any computation on the first pass on data) and then compute GAE later, after batch concatanation.

Topic		Replies	Views
How does the rollout worker pass the trainbatch to the loss function? RLlib	0	249	September 2, 2021
A RolloutWorker died computing advantages RLlib	0	29	July 31, 2024
RLlib Batch Postprocessing has steps from other trajectories RLlib	5	377	April 22, 2024
RLLIB PPO error on non-finished episodes RLlib	2	347	January 13, 2023
Self-play modifications via callbacks RLlib	4	516	February 24, 2023

How to concat rollout batches before computing GAE?

Related topics