Save RNN model's cell and hidden state

klausk55 · November 22, 2021, 4:59pm

Hi all,

in case a policy model contains a RNN or more precise an LSTM-cell, then trainer.save() stores the weights of all trainable variables. However, trainer.save() doesn’t store recent cell and hidden state of the LSTM-cell.
Is there a way to store these tensors (at least cell state), too? Or doesn’t make this sense, I’m not sure about that?! For my understanding, cell state is a long term memory and thus it might be helpful.

Lars_Simon_Zehnder · November 22, 2021, 7:56pm

Hi @klausk55 ,

this is correct the cell state and hidden state do not get stored by trainer.save(). A way to get your states stored is by using RLlib’s Offline API. Just add a file path to the parameter output in your trainer config and it stores all sample batches therein you get the state_out_0 and state_out_1 which should be your hidden and cell state.

Hope this helps

klausk55 · November 23, 2021, 10:32am

Hello @Lars_Simon_Zehnder,

that sounds good to me! I’ve already done this but hadn’t noticed it

What do you think, is it reasonable to load some of the internal RNN states when employing the previously trained policy model? Or what do you use as an initial state when you employ a trained model?

Lars_Simon_Zehnder · November 23, 2021, 4:57pm

Hi @klausk55 ,

this is a great question! As far as I understood it, the hidden state and cell state get reset during evaluation. This makes sense in my opinion, as the hidden state and cell state should at the most keep information from the running episode in memory.

A new episode describes a new path through the MDP and therefore the memory makes sense during an episode but not between several ones. This is an interesting discussion though. Maybe @mannyv , @arturn and @sven1977 have different opinions or something to add here? I am excited to hear.

arturn · November 24, 2021, 9:15am

Hi @klausk55, Hi @Lars_Simon_Zehnder ,

Without having done any experiments: I would argue intuitively that brining the hidden state to the next episode would rather hinder learning than promoting it because the state would encode information that is simply wrong. So if your cell has learned to encode some cool information, and if you bring the hidden state to the next episode, the gradients produced by training on this will likely work against what your network has learned before about what is encoded in that hidden state. Instead, the first hidden state should encode solely that there is no information from prior steps, for example with a bunch of 0s.

Again, this is only my intuition

Cheers!

klausk55 · November 24, 2021, 9:55am

Hi @arturn,

your argumentation sounds good to my ears.
However, what’s about the cell state? Would it be a similar behavior for the cell state as for the hidden state? To my intuition, the cell state encodes some information about previous steps (parts of recent history). Therefore, someone again could argue that this “old information” in a “new situation” is rather counterproductive than helpful for a learned model.

I would appreciate to hear some further thoughts on this!

Lars_Simon_Zehnder · November 24, 2021, 9:56am

Hi @arturn ,

interesting answer! Thanks for replying here! My intuition was quite similar. I thought that walking a different path in the MDP does usually involve very different information due to the randomness. Keeping the state would result in stale information - not useful on the next path of the MDP. Your interpretation with the gradient and no information existent is really nicely layed out.

@klausk55 I hope that helps you to make your decision.

Lars_Simon_Zehnder · November 24, 2021, 10:02am

Hi @klausk55 ,

great question again! In my opinion the same argument holds here. The cell state usually keeps information a little longer, but is also fed by information of the actual episode and therefore specifically relevant for this episode. My intuition tells me that this information in the cell state will still be stale in another episode. Especially if episodes are very long. Interesting experiment though. I argue that the longest memory is actually contained in the weights and these contain gradient information - which are often quite good. So, initialising the cell state and hidden state while weights are kept fixed should give the best results.

klausk55 · November 24, 2021, 10:12am

Hi @Lars_Simon_Zehnder,

my episodes are really, really long! Currently, I reset my env after 24 hours have passed in my simulation.
Thanks guys for the interesting and helpful discussion! It seems that the best choice is

def get_initial_state(self):
    # initial cell and hidden state
    return [np.zeros(self.cell_size, np.float32),
            np.zeros(self.cell_size, np.float32)]

Lars_Simon_Zehnder · November 24, 2021, 10:14am

@klausk55 ,

great discussion indeed! Thanks for bringing this in!

Aslan_Satary_Dizaji · April 17, 2023, 5:56am

Hi all,
I was wondering if someone could tell me how to set the output parameter in my trainer config so to have all hidden states of LSTM network.

I try this way:

trainer_config.update(
{
“num_workers”:
“num_envs_per_worker”:
“train_batch_size”:
“sgd_minibatch_size”:
“num_sgd_iter”:
“output”:
}
)

but it doesn’t work!
Many thanks in advance!

klausk55 · April 17, 2023, 9:53am

Did you try what @Lars_Simon_Zehnder had written above?
Set the output parameter to a string containing a file path of your choice.

Aslan_Satary_Dizaji · April 17, 2023, 12:36pm

Ok, I tried it before but due to my fault, it didn’t work! But now after some tweaking it works. Many thanks for your suggestion!

Now I have a long sequence of data like this in each one of those files:

{“type”: “MultiAgentBatch”, “count”: 200, “policy_batches”: {“a”: {“t”: [400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416 …

I guess these are not the hidden states of LSTM so I was wondering if you could tell me if I need to do something else to save the hidden states in those files.

Again many thanks in advance!

klausk55 · April 17, 2023, 1:33pm

I’m not sure, but somewhere in this written batches you should have variables called state_in_h|c and state_out_h|c or similar. These should be your hidden and cell states.

Aslan_Satary_Dizaji · April 17, 2023, 7:19pm

I found them! Many many thanks!

Aslan_Satary_Dizaji · April 24, 2023, 8:25am

Hi again!

Here is a brief description of the relevant parts of my project:

I am running multi-agent reinforcement learning environment using Ray/RLlib library (version: 0.8.6) for an extension of the AI-Economist framework. Here are the relevant codes:

from rllib.env_wrapper import RLlibEnvWrapper
env_obj = RLlibEnvWrapper({"env_config_dict": env_config_dict}, verbose=True)

import ray
from ray.rllib.agents.ppo import PPOTrainer

from ray.rllib.models.catalog import ModelCatalog
from rllib.tf_models import KerasConvLSTM

ModelCatalog.register_custom_model(KerasConvLSTM.custom_name, KerasConvLSTM)

policies = {
    "a": (
        None,  # uses default policy
        env_obj.observation_space,
        env_obj.action_space,
        #{},
        {'clip_param': 0.3,
         'entropy_coeff': 0.025,
         'entropy_coeff_schedule': None,
         'gamma': 0.998,
         'grad_clip': 10.0,
         'kl_coeff': 0.0,
         'kl_target': 0.01,
         'lambda': 0.98,
         'lr': 0.0003,
         'lr_schedule': None,
         'model': {'custom_model': 'keras_conv_lstm',                  
                   'custom_model_config': {'fc_dim': 128,
                                           'idx_emb_dim': 4,
                                           'input_emb_vocab': 100,
                                           'lstm_cell_size': 128,
                                           'num_conv': 2, 
                                           'num_fc': 2},
                   'max_seq_len': 25},
         'use_gae': True,
         'vf_clip_param': 50.0,
         'vf_loss_coeff': 0.05,
         'vf_share_layers': False}  # define a custom agent policy configuration.
    ),
    "p": (
        None,  # uses default policy
        env_obj.observation_space_pl,
        env_obj.action_space_pl,
        #{},
        {'clip_param': 0.3,
         'entropy_coeff': 0.125,
         'entropy_coeff_schedule': [[0, 2.0], [50000000, 0.125]],
         'gamma': 0.998,
         'grad_clip': 10.0,
         'kl_coeff': 0.0,
         'kl_target': 0.01,
         'lambda': 0.98,
         'lr': 0.0001,
         'lr_schedule': None,
         'model': {'custom_model': 'keras_conv_lstm',
                   'custom_model_config': {'fc_dim': 256,
                                           'idx_emb_dim': 4,
                                           'input_emb_vocab': 100,
                                           'lstm_cell_size': 256,
                                           'num_conv': 2, 
                                           'num_fc': 2},
                   'max_seq_len': 25},
         'use_gae': True,
         'vf_clip_param': 50.0,
         'vf_loss_coeff': 0.05,
         'vf_share_layers': False}  # define a custom planner policy configuration.
    )
}

policy_mapping_fun = lambda i: "a" if str(i).isdigit() else "p"

policies_to_train = ["a", "p"]

trainer_config = {
    "multiagent": {
        "policies": policies,
        "policies_to_train": policies_to_train,
        "policy_mapping_fn": policy_mapping_fun,
    }
}

trainer_config.update(
    {
        "num_workers": 6,
        "num_envs_per_worker": 1,
        # Other training parameters
        "train_batch_size": 4000,
        "sgd_minibatch_size": 4000,
        "num_sgd_iter": 1,
        "output": "D:\\ENI Projects\\Aslan\\AutocurriculaLab\\Githubs\\modified-ai-economist-main\\Results",
    }
)

# We also add the "num_envs_per_worker" parameter for the env. wrapper to index the environments.
env_config = {
    "env_config_dict": env_config_dict,
    "num_envs_per_worker": trainer_config.get('num_envs_per_worker'),   
}

trainer_config.update(
    {
        "env_config": env_config        
    }
)

ray.init()

trainer = PPOTrainer(env=RLlibEnvWrapper, config=trainer_config)

NUM_ITERS = 20

episode_reward_mean = np.zeros(NUM_ITERS)

for iteration in range(NUM_ITERS):
    print(f'********** Iter : {iteration} **********')
    
    result = trainer.train()
    episode_reward_mean[iteration] = result.get('episode_reward_mean')
    print(f'''episode_reward_mean: {episode_reward_mean[iteration]}''')

After running this code for 5 + 1 agents and episode length of 2000, 240 output files are generated having the following structure:
output-date_time_worker_i_j

The number “i” goes from 1 to 6, and the number “j” goes from 0 to 39.

I assume 6 refers to the number of workers, and 40 refers to the number of iterations, 20, multiplied by 2.

In each one of these files, there are 8 sets of state_out_a values, in which “a” goes from 0 to 3. I think these 4 state values refer to the idx_emb_dim defined in my LSTM. Moreover, each one of these values has the size of (1000, 128) or (1000, 256). Again, I think 128 or 256 refer to the lstm_cell_size defined in my LSTM. However, I don’t have any clue about what is the origin of 1000. Basically, my question is that (maybe a naïve question!) if it is possible for an agent to define one state or a set of states in each time-point of an episode.

Many many thanks in advance and sorry for so many questions!

Aslan_Satary_Dizaji · April 24, 2023, 11:25am

One quick note: the number 1000 could be the time-points across half of an episode, since for each iteration there are two output files. This is just my guess and I am not sure! However, there is still one mysterious number, 8, which I don’t know how to relate it to the number of agents in my environment, 5 + 1.

Topic		Replies	Views
Sharing an LSTM cell between policies RLlib	2	391	July 1, 2021
[rllib] Will the hidden state of an rnn policy be reset by default at the end of an episode? RLlib	1	358	June 1, 2021
Load/save replay buffer RLlib	5	763	September 18, 2022
Problem with handling states in RNN RLlib	2	729	February 27, 2023
How to reset rnn states at episode end in a torch model? RLlib	6	491	June 2, 2021

Save RNN model's cell and hidden state

Related topics