Issue with LSTM PPO mask dimension mismatch with custom environment

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

So I have a custom environment and I am able to do training on it using a standard feedforward network using rllib version 2.6.1 with PPO algorithm. I wanted to try out improvement in training agent using LSTM and set use_LSTM to True and vf_share_layers to False. However I am getting the following error message

Exception has occurred: IndexError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
The shape of the mask [128, 11] at index 1 does not match the shape of the indexed tensor [128, 20] at index 1
  File "/Users/paula/Desktop/Projects/venvs/L2RPN_080_RLLIB_261_Grid2OP_195/lib/python3.11/site-packages/ray/rllib/algorithms/ppo/torch/ppo_torch_learner.py", line 59, in possibly_masked_mean
    return torch.sum(t[mask]) / num_valid
                     ~^^^^^^
  File "/Users/paula/Desktop/Projects/venvs/L2RPN_080_RLLIB_261_Grid2OP_195/lib/python3.11/site-packages/ray/rllib/algorithms/ppo/torch/ppo_torch_learner.py", line 88, in compute_loss_for_module
    mean_kl_loss = possibly_masked_mean(action_kl)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/paula/Desktop/Projects/venvs/L2RPN_080_RLLIB_261_Grid2OP_195/lib/python3.11/site-packages/ray/rllib/core/learner/learner.py", line 995, in compute_loss
    loss = self.compute_loss_for_module(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/paula/Desktop/Projects/venvs/L2RPN_080_RLLIB_261_Grid2OP_195/lib/python3.11/site-packages/ray/rllib/core/learner/torch/torch_learner.py", line 123, in _uncompiled_update
    loss_per_module = self.compute_loss(fwd_out=fwd_out, batch=batch)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/paula/Desktop/Projects/venvs/L2RPN_080_RLLIB_261_Grid2OP_195/lib/python3.11/site-packages/ray/rllib/core/learner/torch/torch_learner.py", line 365, in _update
    return self._possibly_compiled_update(batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/paula/Desktop/Projects/venvs/L2RPN_080_RLLIB_261_Grid2OP_195/lib/python3.11/site-packages/ray/rllib/core/learner/learner.py", line 1220, in update
    ) = self._update(nested_tensor_minibatch)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/paula/Desktop/Projects/venvs/L2RPN_080_RLLIB_261_Grid2OP_195/lib/python3.11/site-packages/ray/rllib/core/learner/learner_group.py", line 184, in update
    self._learner.update(
  File "/Users/paula/Desktop/Projects/venvs/L2RPN_080_RLLIB_261_Grid2OP_195/lib/python3.11/site-packages/ray/rllib/algorithms/ppo/ppo.py", line 448, in training_step
    train_results = self.learner_group.update(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/paula/Desktop/Projects/venvs/L2RPN_080_RLLIB_261_Grid2OP_195/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm.py", line 2837, in _run_one_training_iteration
    results = self.training_step()
              ^^^^^^^^^^^^^^^^^^^^
  File "/Users/paula/Desktop/Projects/venvs/L2RPN_080_RLLIB_261_Grid2OP_195/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm.py", line 853, in step
    results, train_iter_ctx = self._run_one_training_iteration()
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/paula/Desktop/Projects/venvs/L2RPN_080_RLLIB_261_Grid2OP_195/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 375, in train
    raise skipped from exception_cause(skipped)
  File "/Users/paula/Desktop/Projects/venvs/L2RPN_080_RLLIB_261_Grid2OP_195/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 375, in train
    raise skipped from exception_cause(skipped)
  File "/Users/paula/Desktop/Projects/RL Practice/RLLIB_Practice4/train_LSTM.py", line 239, in train
    result = agent.nn_model.train()

I am not sharing the full code as it is very complicated, but I did a bit of debugging and saw that the issue is occurring on this line https://github.com/ray-project/ray/blob/a2d38078d3a2f502c0e22c1132745e206181810c/rllib/algorithms/ppo/torch/ppo_torch_learner.py#L59 Here the variable t passed into is actually action_kl variable. The reason for the issue is obvious mask is a boolean matrix with different number of True and False on each row. t.shape is [128, 20] and mask.shape is [128, 11] which is causing this issue.

Now mask is getting computed here https://github.com/ray-project/ray/blob/fd9a02e9cef9cff0e58e99274622c651e1227f4c/rllib/algorithms/ppo/torch/ppo_torch_learner.py#L55 and maxlen is coming out to be 11 because batch[SampleBatch.SEQ_LENS] has the following values

tensor([ 4., 4., 6., 3., 3., 2., 1., 5., 3., 3., 4., 3., 1., 8., 2., 5., 3., 7., 1., 4., 6., 4., 9., 7., 7., 1., 6., 6., 1., 7., 2., 3., 4., 2., 7., 1., 1., 4., 1., 5., 10., 7., 5., 6., 2., 3., 8., 1., 1., 9., 3., 5., 1., 2., 3., 1., 3., 2., 5., 3., 2., 4., 1., 4., 5., 2., 2., 4., 2., 2., 2., 3., 3., 4., 3., 1., 2., 4., 1., 5., 5., 2., 2., 3., 4., 11., 1., 4., 3., 5., 1., 5., 3., 5., 3., 3., 3., 3., 2., 1., 3., 5., 4., 3., 1., 3., 4., 3., 5., 3., 4., 4., 4., 4., 3., 2., 3., 10., 6., 1., 11., 2., 2., 6., 4., 1., 6., 3.])

Since the highest value is 11 so mask has. a shape of [128, 11].

My question is do I need to do set some values or is this is a issue with the code, any feedback and suggestions would be appreciated!

I used PPO with lstm for my complicated diy-env and I can’t get reward which always is nan .