Save RNN model's cell and hidden state

Hi all,

in case a policy model contains a RNN or more precise an LSTM-cell, then trainer.save() stores the weights of all trainable variables. However, trainer.save() doesn’t store recent cell and hidden state of the LSTM-cell.
Is there a way to store these tensors (at least cell state), too? Or doesn’t make this sense, I’m not sure about that?! For my understanding, cell state is a long term memory and thus it might be helpful.

Hi @klausk55 ,

this is correct the cell state and hidden state do not get stored by trainer.save(). A way to get your states stored is by using RLlib’s Offline API. Just add a file path to the parameter output in your trainer config and it stores all sample batches therein you get the state_out_0 and state_out_1 which should be your hidden and cell state.

Hope this helps

1 Like

Hello @Lars_Simon_Zehnder,

that sounds good to me! I’ve already done this but hadn’t noticed it :sweat_smile:

What do you think, is it reasonable to load some of the internal RNN states when employing the previously trained policy model? Or what do you use as an initial state when you employ a trained model?

Hi @klausk55 ,

this is a great question! As far as I understood it, the hidden state and cell state get reset during evaluation. This makes sense in my opinion, as the hidden state and cell state should at the most keep information from the running episode in memory.

A new episode describes a new path through the MDP and therefore the memory makes sense during an episode but not between several ones. This is an interesting discussion though. Maybe @mannyv , @arturn and @sven1977 have different opinions or something to add here? I am excited to hear.

1 Like

Hi @klausk55, Hi @Lars_Simon_Zehnder ,

Without having done any experiments: I would argue intuitively that brining the hidden state to the next episode would rather hinder learning than promoting it because the state would encode information that is simply wrong. So if your cell has learned to encode some cool information, and if you bring the hidden state to the next episode, the gradients produced by training on this will likely work against what your network has learned before about what is encoded in that hidden state. Instead, the first hidden state should encode solely that there is no information from prior steps, for example with a bunch of 0s.

Again, this is only my intuition :slight_smile:

Cheers!

2 Likes

Hi @arturn,

your argumentation sounds good to my ears.
However, what’s about the cell state? Would it be a similar behavior for the cell state as for the hidden state? To my intuition, the cell state encodes some information about previous steps (parts of recent history). Therefore, someone again could argue that this “old information” in a “new situation” is rather counterproductive than helpful for a learned model.

I would appreciate to hear some further thoughts on this!

1 Like

Hi @arturn ,

interesting answer! Thanks for replying here! My intuition was quite similar. I thought that walking a different path in the MDP does usually involve very different information due to the randomness. Keeping the state would result in stale information - not useful on the next path of the MDP. Your interpretation with the gradient and no information existent is really nicely layed out.

@klausk55 I hope that helps you to make your decision.

1 Like

Hi @klausk55 ,

great question again! In my opinion the same argument holds here. The cell state usually keeps information a little longer, but is also fed by information of the actual episode and therefore specifically relevant for this episode. My intuition tells me that this information in the cell state will still be stale in another episode. Especially if episodes are very long. Interesting experiment though. I argue that the longest memory is actually contained in the weights and these contain gradient information - which are often quite good. So, initialising the cell state and hidden state while weights are kept fixed should give the best results.

1 Like

Hi @Lars_Simon_Zehnder,

my episodes are really, really long! Currently, I reset my env after 24 hours have passed in my simulation.
Thanks guys for the interesting and helpful discussion! It seems that the best choice is

def get_initial_state(self):
    # initial cell and hidden state
    return [np.zeros(self.cell_size, np.float32),
            np.zeros(self.cell_size, np.float32)]
2 Likes

@klausk55 ,

great discussion indeed! Thanks for bringing this in!

2 Likes

Hi all,
I was wondering if someone could tell me how to set the output parameter in my trainer config so to have all hidden states of LSTM network.

I try this way:

trainer_config.update(
{
“num_workers”:
“num_envs_per_worker”:
“train_batch_size”:
“sgd_minibatch_size”:
“num_sgd_iter”:
“output”:
}
)

but it doesn’t work!
Many thanks in advance!

Did you try what @Lars_Simon_Zehnder had written above?
Set the output parameter to a string containing a file path of your choice.

Ok, I tried it before but due to my fault, it didn’t work! But now after some tweaking it works. Many thanks for your suggestion!

Now I have a long sequence of data like this in each one of those files:

{“type”: “MultiAgentBatch”, “count”: 200, “policy_batches”: {“a”: {“t”: [400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416 …

I guess these are not the hidden states of LSTM so I was wondering if you could tell me if I need to do something else to save the hidden states in those files.

Again many thanks in advance!

I’m not sure, but somewhere in this written batches you should have variables called state_in_h|c and state_out_h|c or similar. These should be your hidden and cell states.

I found them! Many many thanks!

Hi again!

Here is a brief description of the relevant parts of my project:

I am running multi-agent reinforcement learning environment using Ray/RLlib library (version: 0.8.6) for an extension of the AI-Economist framework. Here are the relevant codes:

from rllib.env_wrapper import RLlibEnvWrapper
env_obj = RLlibEnvWrapper({"env_config_dict": env_config_dict}, verbose=True)
import ray
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.models.catalog import ModelCatalog
from rllib.tf_models import KerasConvLSTM

ModelCatalog.register_custom_model(KerasConvLSTM.custom_name, KerasConvLSTM)

policies = {
    "a": (
        None,  # uses default policy
        env_obj.observation_space,
        env_obj.action_space,
        #{},
        {'clip_param': 0.3,
         'entropy_coeff': 0.025,
         'entropy_coeff_schedule': None,
         'gamma': 0.998,
         'grad_clip': 10.0,
         'kl_coeff': 0.0,
         'kl_target': 0.01,
         'lambda': 0.98,
         'lr': 0.0003,
         'lr_schedule': None,
         'model': {'custom_model': 'keras_conv_lstm',                  
                   'custom_model_config': {'fc_dim': 128,
                                           'idx_emb_dim': 4,
                                           'input_emb_vocab': 100,
                                           'lstm_cell_size': 128,
                                           'num_conv': 2, 
                                           'num_fc': 2},
                   'max_seq_len': 25},
         'use_gae': True,
         'vf_clip_param': 50.0,
         'vf_loss_coeff': 0.05,
         'vf_share_layers': False}  # define a custom agent policy configuration.
    ),
    "p": (
        None,  # uses default policy
        env_obj.observation_space_pl,
        env_obj.action_space_pl,
        #{},
        {'clip_param': 0.3,
         'entropy_coeff': 0.125,
         'entropy_coeff_schedule': [[0, 2.0], [50000000, 0.125]],
         'gamma': 0.998,
         'grad_clip': 10.0,
         'kl_coeff': 0.0,
         'kl_target': 0.01,
         'lambda': 0.98,
         'lr': 0.0001,
         'lr_schedule': None,
         'model': {'custom_model': 'keras_conv_lstm',
                   'custom_model_config': {'fc_dim': 256,
                                           'idx_emb_dim': 4,
                                           'input_emb_vocab': 100,
                                           'lstm_cell_size': 256,
                                           'num_conv': 2, 
                                           'num_fc': 2},
                   'max_seq_len': 25},
         'use_gae': True,
         'vf_clip_param': 50.0,
         'vf_loss_coeff': 0.05,
         'vf_share_layers': False}  # define a custom planner policy configuration.
    )
}

policy_mapping_fun = lambda i: "a" if str(i).isdigit() else "p"

policies_to_train = ["a", "p"]
trainer_config = {
    "multiagent": {
        "policies": policies,
        "policies_to_train": policies_to_train,
        "policy_mapping_fn": policy_mapping_fun,
    }
}
trainer_config.update(
    {
        "num_workers": 6,
        "num_envs_per_worker": 1,
        # Other training parameters
        "train_batch_size": 4000,
        "sgd_minibatch_size": 4000,
        "num_sgd_iter": 1,
        "output": "D:\\ENI Projects\\Aslan\\AutocurriculaLab\\Githubs\\modified-ai-economist-main\\Results",
    }
)
# We also add the "num_envs_per_worker" parameter for the env. wrapper to index the environments.
env_config = {
    "env_config_dict": env_config_dict,
    "num_envs_per_worker": trainer_config.get('num_envs_per_worker'),   
}

trainer_config.update(
    {
        "env_config": env_config        
    }
)
ray.init()
trainer = PPOTrainer(env=RLlibEnvWrapper, config=trainer_config)
NUM_ITERS = 20

episode_reward_mean = np.zeros(NUM_ITERS)

for iteration in range(NUM_ITERS):
    print(f'********** Iter : {iteration} **********')
    
    result = trainer.train()
    episode_reward_mean[iteration] = result.get('episode_reward_mean')
    print(f'''episode_reward_mean: {episode_reward_mean[iteration]}''')

After running this code for 5 + 1 agents and episode length of 2000, 240 output files are generated having the following structure:
output-date_time_worker_i_j

The number “i” goes from 1 to 6, and the number “j” goes from 0 to 39.

I assume 6 refers to the number of workers, and 40 refers to the number of iterations, 20, multiplied by 2.

In each one of these files, there are 8 sets of state_out_a values, in which “a” goes from 0 to 3. I think these 4 state values refer to the idx_emb_dim defined in my LSTM. Moreover, each one of these values has the size of (1000, 128) or (1000, 256). Again, I think 128 or 256 refer to the lstm_cell_size defined in my LSTM. However, I don’t have any clue about what is the origin of 1000. Basically, my question is that (maybe a naïve question!) if it is possible for an agent to define one state or a set of states in each time-point of an episode.

Many many thanks in advance and sorry for so many questions!

One quick note: the number 1000 could be the time-points across half of an episode, since for each iteration there are two output files. This is just my guess and I am not sure! However, there is still one mysterious number, 8, which I don’t know how to relate it to the number of agents in my environment, 5 + 1.