Custom RLModule with LSTM fails to concat episodes of varying lengths together (internal ray error)

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 2.42.1
  • Python version: 3.12.9
  • OS: Windows 11
  • Cloud/Infrastructure: Local Machine Only
  • Other libs/tools (if relevant): None

3. What happened vs. what you expected:

  • Using: trainer = config.build_algo(), result = trainer.train()
  • Expected: The algorithm gathers data by running all episodes until all data has been gathered and then begin training/backprop, and then returns a finished result.
  • Actual: The algorithm gathers data and episodes are all finished and data fully collected, before training begins the program crashes do to the following error:
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\numpy\_core\shape_base.py", line 449, in stack
    raise ValueError('all input arrays must have the same shape')

Steps for Problem Diagnoses

  • In the shape_base.py file, i made the following code change:
    shapes = {arr.shape for arr in arrays}
    print("shapes: ", shapes) #ADDED HERE LINE 447
    if len(shapes) != 1:
        raise ValueError('all input arrays must have the same shape')
  • the print prior to this exception is the following:
     shapes:  {(58, 1, 7), (35, 1, 7), (41, 1, 7), (38, 1, 7), (57, 1, 7)}
  • Through testing and observation I figured out the the number of arrays here corresponds to the number of episodes that were ran during the exploration and the first column per array corresponds to the number of steps that were executed per episode (the other 2 axis are part of the observation space for my data which is basically just a 2D holder for my features)
  • So the problem here seems to be fairly obvious, which is that the episodes are not being padded up or truncated properly internally within ray before it tries to combine them after exploration has completed (note that setting config.env_runners(batch_mode = ‘truncate_episodes’) does not change this error at all)
  • I understand that this could be a problem with how my data is structured, but also based on the shape prints I have reason to believe I could be missing some sort of configuration parameter on my customrlmodule(i have tried changing things like train_batch_size, minibatch_size, rollout_fragement_length, max_seq_len, use_lstm, and nothing changes), or maybe I have not implemented something in my customrlmodule or that it could be some kind of internal failure
  • I was following this implementation as a guide:
https://github.com/ray-project/ray/blob/master/rllib/examples/rl_modules/classes/lstm_containing_rlm.py#L99

Additional Notes on the CustomRLModule

  • Originally I didn’t do an LSTM implementation, and training worked fine across multiple epochs and other configuration params, this error only started occuring when i started returning the internal state and setting config params like “max_seq_len”
  • My module is designed to take in heterogenous graphs as input (only the features for nodes as edge_index will remain the same), so the inputs recieved into my model are of the shape [B, T, N, F] B= batch, T = timestep, N=node, F=features (I am not shipping T out of the env, T started being added by ray when i started using the LSTM)
my customrlmodule inherits from: TorchRLModule, ValueFunctionAPI
implemented methods are: setup(), get_initial_state(), _forward(), compute_values()
if more specific details on the code are needed please ask and I can get them for you, there just is a lot to explain since im using custom env's and custom wrappers and i dont know what all is relevant

Full Error Stack Trace:

Traceback (most recent call last):
  File "z:\Thesis\Reinforcement Learning\Trainer.py", line 182, in <module>
    result = trainer.train()
             ^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\tune\trainable\trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\tune\trainable\trainable.py", line 328, in train
    result = self.step()
             ^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\algorithms\algorithm.py", line 1022, in step
    train_results, train_iter_ctx = self._run_one_training_iteration()
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\algorithms\algorithm.py", line 3382, in _run_one_training_iteration
    training_step_return_value = self.training_step()
                                 ^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\algorithms\ppo\ppo.py", line 429, in training_step
    learner_results = self.learner_group.update_from_episodes(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\core\learner\learner_group.py", line 327, in update_from_episodes
    return self._update(
           ^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\core\learner\learner_group.py", line 422, in _update
    _learner_update(
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\core\learner\learner_group.py", line 385, in _learner_update
    result = _learner.update_from_episodes(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\core\learner\learner.py", line 1086, in update_from_episodes
    self._update_from_batch_or_episodes(
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\core\learner\learner.py", line 1362, in _update_from_batch_or_episodes
    batch = self._learner_connector(
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\connectors\learner\learner_connector_pipeline.py", line 38, in __call__
    ret = super().__call__(
          ^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\connectors\connector_pipeline_v2.py", line 111, in __call__
    batch = connector(
            ^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\connectors\common\batch_individual_items.py", line 182, in __call__
    else batch_fn(
         ^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\utils\spaces\space_utils.py", line 378, in batch
    ret = tree.map_structure(lambda *s: np_func(s, axis=0), *list_of_structs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\tree\__init__.py", line 429, in map_structure
    [func(*args) for args in zip(*map(flatten, structures))])
     ^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\utils\spaces\space_utils.py", line 378, in <lambda>
    ret = tree.map_structure(lambda *s: np_func(s, axis=0), *list_of_structs)
                                        ^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\numpy\_core\shape_base.py", line 449, in stack
    raise ValueError('all input arrays must have the same shape')

Ok so a couple updates:

The print added in shape_base.py

  • I figured out some more details here, the number of arrays is indeed the number of episodes, but the first column’s size is not exactly the number of steps in that episode, it turns out that its actually num_steps(in this episode only)+max_seq_len(config param)
  • If a developer could please tell me if this is the intended functionality, this will help with a new problem that has appeared (look below)

Temporary Workaround

  • To work around the error the environment was set to always run the same number of steps, and then train_batch_size in config.training() must be set to a factor of that number of set steps
  • This is not an acceptable long term solution and im not sure how to fix it

Other issue (solved maybe?)

  • For some reason, an error occurs in batch_individual_items.py, its trying to merge the state_ins, but the first element in each of these always ends up being empty… heres the code to fix it, i think this is some problem with ray but more testing is needed to find out if ray is just trying to merge something it shouldnt or if i am missing some kind of return or preset variable somewhere (obviously this only applies to me using h and c as my keys here, but a solution that looks for any keys would be more practical for all users in case these issues are not isolated)
#ADDED AT LINE 181
if(column==Columns.STATE_IN and len(list_to_be_batched[0]['h'])>1):
                    print("trying to batch: ", column, "...: ", list_to_be_batched[0]['h'][1].shape, "btw:", len(list_to_be_batched), "also: ", len(list_to_be_batched[0]['h']))
                    print('thing:', list_to_be_batched[0]['h'][0])
                    for item in list_to_be_batched[:]:
                        # Check if 'h' key exists and has at least one element
                        if 'h' in item and len(item['h']) > 0:
                            first_element = item['h'][0]
                            print(f"Type of the first element: {type(first_element)}, is bnd? {isinstance(first_element, BatchedNdArray)}, and? {first_element.size}, also? {first_element}, with: {first_element==None}")
                            # Check if the first element is None
                            if isinstance(first_element, BatchedNdArray) and first_element == None:
                                print("none found!")
                                # Remove the first element
                                item['h'] = item['h'][1:]
                                item['c'] = item['c'][1:]

A New Problem

  • Now at this point, forward_exploration completes, the _compute_value() function gets called and completes and my _forward_train() begins

  • Some notes on the sizes: the first dim is the minibatch_size/max_seq_len, the second is different for the value and the batch+action shape; for value: its max_seq_len, but for batch+action: its what i described earlier(max_seq_len+time_steps{of this episode only}), the rest are related to my data (this is actually the action dim)

  • full stack trace

Traceback (most recent call last):
  File "z:\Thesis\Reinforcement Learning\Trainer.py", line 184, in <module>
    result = trainer.train()
             ^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\tune\trainable\trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\tune\trainable\trainable.py", line 328, in train
    result = self.step()
             ^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\algorithms\algorithm.py", line 1022, in step
    train_results, train_iter_ctx = self._run_one_training_iteration()
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\algorithms\algorithm.py", line 3382, in _run_one_training_iteration
    training_step_return_value = self.training_step()
                                 ^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\algorithms\ppo\ppo.py", line 429, in training_step
    learner_results = self.learner_group.update_from_episodes(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\core\learner\learner_group.py", line 327, in update_from_episodes
    return self._update(
           ^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\core\learner\learner_group.py", line 422, in _update
    _learner_update(
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\core\learner\learner_group.py", line 385, in _learner_update
    result = _learner.update_from_episodes(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\core\learner\learner.py", line 1086, in update_from_episodes
    self._update_from_batch_or_episodes(
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\core\learner\learner.py", line 1434, in _update_from_batch_or_episodes
    fwd_out, loss_per_module, tensor_metrics = self._update(
                                               ^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\core\learner\torch\torch_learner.py", line 497, in _update
    return self._possibly_compiled_update(batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\core\learner\torch\torch_learner.py", line 152, in _uncompiled_update
    loss_per_module = self.compute_losses(fwd_out=fwd_out, batch=batch)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\core\learner\learner.py", line 924, in compute_losses
    loss = self.compute_loss_for_module(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\algorithms\ppo\torch\ppo_torch_learner.py", line 75, in compute_loss_for_module
    curr_action_dist.logp(batch[Columns.ACTIONS]) - batch[Columns.ACTION_LOGP]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\models\torch\torch_distributions.py", line 215, in logp
    return super().logp(value).sum(-1)
           ^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\ray\rllib\models\torch\torch_distributions.py", line 38, in logp
    return self._dist.log_prob(value, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\torch\distributions\normal.py", line 82, in log_prob
    self._validate_sample(value)
  File "Z:\Thesis\Reinforcement Learning\venv\Lib\site-packages\torch\distributions\distribution.py", line 302, in _validate_sample
    raise ValueError(
ValueError: Value is not broadcastable with batch_shape+event_shape: torch.Size([1, 15, 23, 3]) vs torch.Size([1, 30, 23, 3]).

if you have time if you could please help with the train batch sizes being off that would help a lot, ty
@sven1977 @christina

Hey @BigBootyFarma could you provide a small reproducable example. That would help to provide a general solution.