When are MARL replay buffers zero padded?

This post is related to this previous question; the gist is during training, my custom model was given empty batches (the first dimension of obs is 0). I found that this is caused by data arrays being shorter than the calculated slice map in _slice method of SampleBatch (e.g., data arrays are actually 500 but the slice was 550:600).

I found that this is in turn caused by the model’s zero_padded attribute being true, but the data arrays aren’t actually padded, resulting in out-of-range access and thus returning empty batches or batches with different sizes than specified by seq_len. I added these print statements before the call to tree.map_structure_with_path(map_, self):


And one of the print results immediately before my model’s error is this:

(PPO pid=45124) 6000
(PPO pid=45124) 5900
(PPO pid=45124) 6020
(PPO pid=45124) True
(PPO pid=45124) (6000, 210, 160, 3)

These lines show the discrepancy here: len(self) returns the sum of self['seq_len'], and thus should be shorter than the actual data array (I’ve manually checked that seq_len contains different length sequences). However, len(self[SampleBatch.OBS][0])=6000=len(self).

I’m trying to find where the zero-padding actually happened, and I found that the right_zero_pad function in sample_batch.py isn’t actually called. Am I missing something important/obvious? Is this the correct direction to debug this problem?


I use tuple observation spaces in some of my environment fine so I would first suspect there is something off with either the custom env or model. That said you may have found a bug in rllib. Too soon to tell.

Here is what I would do to test.

First I would use the RandomEnv environment to set up an environment with the desired Observation and Action spaces. Then I would use the built-in models to test and see if an error occurs. If it does let us know and we can try and track down the issues.

If that works fine, the I would do two things.

1.) Test the custom env with the built-in models.
If this has an issue then you would suspect your env is not behaving as expected.

2.) Use the random env with the custom model.
If this has an issue then your custom model probably needs some tweeking.

If you have a reproduction script feel free to share it.

@mannyv Thanks for the prompt reply! I think your suggestion is more systematic, but I decided it was hard to replicate the behavior of my custom env. I did find a snippet of code in RLlib that looks very suspicious and should be the cause for my problem:

# RNN, attention net, or multi-agent case.
state_keys = []
feature_keys_ = feature_keys or []
for k, v in batch.items():
    if k.startswith("state_in_"):
    elif not feature_keys and not k.startswith("state_out_") and \
            k not in ["infos", SampleBatch.SEQ_LENS] and \
            isinstance(v, np.ndarray):

These lines are around line 105 in rnn_sequencing.py. This loop is trying to select which dict keys to pad zeros: Only the features whose keys are infeature_keys_ are padded with zero. My observation space is a tuple, and thus won’t be added to the feature_keys_ list, and thus is not padded.

Another potential problem with complex data is in how data are padded. In chop_into_sequences function in the same file, there are these lines:

feature_sequences = []
for f in feature_columns:
    # Save unnecessary copy.
    if not isinstance(f, np.ndarray):
        f = np.array(f)
    length = len(seq_lens) * max_seq_len
    if f.dtype == np.object or f.dtype.type is np.str_:
        f_pad = [None] * length

It’s easy to see that if a tuple feature is to be padded, it would have a dtype of np.object and thus create a f_pad that’s padded with Nones. This will very likely produce errors when batched. A more general method (if rllib plans to support padding complex/recursive spaces) might to use tree and map_structure to recursively pad.

Hi @Aceticia,

My suggestion was not to reproduce the logic of your custom env but merely the interface between components.

What are your obs_space and action spaces?

This is the space definitions:

        # Set obs and act space
        self.action_space = spaces.Tuple(

        # Observation space is squished tuple.
        self.observation_space = spaces.Tuple(
             # Env obs
             # Previous action
             # Other party's actions
             # Other party's msgs
             # Whether supporting


Here is a colab that uses spaces similar to the one you posted with the random env. Episodes have an expected length of 20 with a probability of being done on any particular step of 5%.

It also has an lstm using the auto_lstm feature with a max_seq_len of 10.

I’m not sure why exactly, but this seems to work for tf but not torch. Here is a reproduction script:

import ray
import gym
from ray import tune

import numpy as np
from gym.spaces import Discrete, Tuple, Box

class RandomEnv(gym.Env):
    """A randomly acting environment.
    Can be instantiated with arbitrary action-, observation-, and reward
    spaces. Observations and rewards are generated by simply sampling from the
    observation/reward spaces. The probability of a `done=True` can be
    configured as well.

    def __init__(self, config=None):
        config = config or {}

        # Action space.
        self.action_space = Tuple((Discrete(2), Discrete(4)))
        # Observation space from which to sample.
        self.observation_space = Tuple((Box(0, 1, (5,5)), Discrete(2), Discrete(4)))
        # Reward space from which to sample.
        self.reward_space = config.get(
            gym.spaces.Box(low=-1.0, high=1.0, shape=(), dtype=np.float32))
        # Chance that an episode ends at any step.
        self.p_done = config.get("p_done", 0.1)
        # A max episode length.
        self.max_episode_len = config.get("max_episode_len", None)
        # Whether to check action bounds.
        self.check_action_bounds = config.get("check_action_bounds", False)
        # Steps taken so far (after last reset).
        self.steps = 0

    def reset(self):
        self.steps = 0
        return self.observation_space.sample()

    def step(self, action):
        if self.check_action_bounds and not self.action_space.contains(action):
            raise ValueError("Illegal action for {}: {}".format(
                self.action_space, action))
        if (isinstance(self.action_space, Tuple)
                and len(action) != len(self.action_space.spaces)):
            raise ValueError("Illegal action for {}: {}".format(
                self.action_space, action))

        self.steps += 1
        done = False
        # We are done as per our max-episode-len.
        if self.max_episode_len is not None and \
                self.steps >= self.max_episode_len:
            done = True
        # Max not reached yet -> Sample done via p_done.
        elif self.p_done > 0.0:
            done = bool(
                    [True, False], p=[self.p_done, 1.0 - self.p_done]))

        return self.observation_space.sample(), \
            float(self.reward_space.sample()), done, {}

if __name__ == '__main__':

    run_config = {
        "env": RandomEnv,
        "model": {
            "max_seq_len": 20,
            "use_lstm": True,
        "framework": "torch",
        "num_gpus": 1,
        "num_workers": 0,

    results = tune.run("PPO", config=run_config, verbose=1)

The error is like this:

(pid=113923) 2021-10-11 21:47:54,766	WARNING deprecation.py:38 -- DeprecationWarning: `SampleBatch['is_training']` has been deprecated. Use `SampleBatch.is_training` instead. This will raise an error in the future!
(pid=113923) 2021-10-11 21:47:54,768	ERROR worker.py:425 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=113923, ip=
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 137, in __init__
(pid=113923)     Trainer.__init__(self, config, env, logger_creator)
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 622, in __init__
(pid=113923)     super().__init__(config, logger_creator)
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/tune/trainable.py", line 106, in __init__
(pid=113923)     self.setup(copy.deepcopy(self.config))
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 147, in setup
(pid=113923)     super().setup(config)
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 775, in setup
(pid=113923)     self._init(self.config, self.env_creator)
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 171, in _init
(pid=113923)     self.workers = self._make_workers(
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 857, in _make_workers
(pid=113923)     return WorkerSet(
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 110, in __init__
(pid=113923)     self._local_worker = self._make_worker(
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 406, in _make_worker
(pid=113923)     worker = cls(
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 584, in __init__
(pid=113923)     self._build_policy_map(
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1376, in _build_policy_map
(pid=113923)     self.policy_map.create_policy(name, orig_cls, obs_space, act_space,
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/policy/policy_map.py", line 143, in create_policy
(pid=113923)     self[policy_id] = class_(observation_space, action_space,
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/policy/policy_template.py", line 280, in __init__
(pid=113923)     self._initialize_loss_from_dummy_batch(
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/policy/policy.py", line 731, in _initialize_loss_from_dummy_batch
(pid=113923)     self.compute_actions_from_input_dict(
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 304, in compute_actions_from_input_dict
(pid=113923)     return self._compute_action_helper(input_dict, state_batches,
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
(pid=113923)     return func(self, *a, **k)
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 368, in _compute_action_helper
(pid=113923)     dist_inputs, state_out = self.model(input_dict, state_batches,
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/models/modelv2.py", line 231, in __call__
(pid=113923)     restored["obs"] = restore_original_dimensions(
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/models/modelv2.py", line 389, in restore_original_dimensions
(pid=113923)     return _unpack_obs(obs, original_space, tensorlib=tensorlib)
(pid=113923)   File "/home/xl3942/anaconda3/envs/CommAgent/lib/python3.8/site-packages/ray/rllib/models/modelv2.py", line 423, in _unpack_obs
(pid=113923)     raise ValueError(
(pid=113923) ValueError: Expected flattened obs shape of [..., 31], got torch.Size([32, 27])

@mannyv Hi, sorry about the spamming, I’ve found my misunderstanding about the code. Normally tuple spaces are handled by flattening, but I disabled my preprocessor because of this previous issue I had. Therefore the padding couldn’t be done correctly. The reproduction script is likely also due to the previous issue I had that hadn’t been fixed yet.

So I think a solution to this should be instead of disabling the preprocessor, I should write a custom pre-processor that only flattens tuple into vectors? Is this correct?


I am also reproducing your issue with the example observation space you provided for both tf and torch.
It is coming from the Box space being 2D. If I flatten it from Box(…,(5,5)) to Box(…,(25,)) then the error goes away. I am not sure why yet though.