How to get Curiosity Policy Weights from a Policy Client

Hi all,

I just have a quick question as to how I can get the curiosity module weights from policy_client(s) (the ones built into Rrlib)

Thanks!

1 Like

Hey @Denys_Ashikhin , great question. The curiosity model consists of 3 sub-modules: feature-net, inverse-dynamics-net, and forward-net. These will be added onto your “normal” policy model (trainer.get_policy().model.[_curiosity_feature_net|_curiosity_forward_fcnet|_curiosity_inverse_fcnet]). All these are native torch (nn.Module) or tf.keras Models.

So you could do something like:

torch:
trainer.get_policy().model.[_curiosity_feature_net|_curiosity_forward_fcnet|_curiosity_inverse_fcnet].state_dict()

tf:
trainer.get_policy().model.[_curiosity_feature_net|_curiosity_forward_fcnet|_curiosity_inverse_fcnet].variables
1 Like

Okay, there’s a lot of values there :sweat_smile:, what would be the best way to compare if the model’s parameters are actually changing between checkpoints? I had someone mention that my curiosity module wasn’t even changing between iterations so I’m not sure what values I should compare when loading in the different checkpoints just to see that it actually is learning!

@Denys_Ashikhin I just summed them last time I checked.

Just to make sure I’m on the same page here, you mean something like this?

#load in checkpoint-1
sum1= 0
sum29 = 0

sum1 += trainer.get_policy().model._curiosity_feature_net.variables
sum1 += trainer.get_policy().model._curiosity_forward_fcnet.variables
sum1 += trainer.get_policy().model._curiosity_inverse_fcnet.variables 

load in checkpoint-29
sum29 += trainer.get_policy().model._curiosity_feature_net.variables
sum29 += trainer.get_policy().model._curiosity_forward_fcnet.variables
sum29 += trainer.get_policy().model._curiosity_inverse_fcnet.variables 

print(sum1 == sum2)

?

@Denys_Ashikhin

The variables member is probably a dictionary so you will need to enumerate the values of the dictionary. The values are going to be a numpy array so you will need to sum them.

Something like
sum(v.sum() for v in trainer.get_policy().model._curiosity_feature_net.variables().values())

Also you might want to use np.isclose() to test for equality since they could be equal but not exactly the same for silly numerical machines precision reasons.


I’m having with issue with the .variables.values() which doesn’t seem to be legit. I did inspect the .model inside debug mode but there’s so much going on that I’m not sure which values are what I want :sweat_smile:

I think variables is a method so you need to call it. I updated the previous reply. It is possible it is a list but I am pretty sure it’s a dict

Sorry for taking so long to reply back, here is the sample code I am using:

import os
import ray
from ray.rllib.agents import with_common_config
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.env import PolicyServerInput

from ray.rllib.examples.env.random_env import RandomEnv
from gym import spaces

DEFAULT_CONFIG = with_common_config({
    # Should use a critic as a baseline (otherwise don't use value baseline;
    # required for using GAE).
    "use_critic": True,
    # If true, use the Generalized Advantage Estimator (GAE)
    # with a value function, see https://arxiv.org/pdf/1506.02438.pdf.
    "use_gae": True,
    # The GAE (lambda) parameter.
    "lambda": 1.0,
    # Initial coefficient for KL divergence.
    "kl_coeff": 0.2,
    # Size of batches collected from each worker.
    "rollout_fragment_length": 20,
    # Number of timesteps collected for each SGD round. This defines the size
    # of each SGD epoch.
    "train_batch_size": 5000,
    # Total SGD batch size across all devices for SGD. This defines the
    # minibatch size within each epoch.
    "sgd_minibatch_size": 200,
    # Number of SGD iterations in each outer loop (i.e., number of epochs to
    # execute per train batch).
    "num_sgd_iter": 25,
    # Whether to shuffle sequences in the batch when training (recommended).
    "shuffle_sequences": True,
    # Stepsize of SGD.
    "lr": 3e-5,
    # Learning rate schedule.
    "lr_schedule": None,
    # Coefficient of the value function loss. IMPORTANT: you must tune this if
    # you set vf_share_layers=True inside your model's config.
    "vf_loss_coeff": 1.0,
    "model": {
        # Share layers for value function. If you set this to True, it's
        # important to tune vf_loss_coeff.
        "vf_share_layers": False,
        "fcnet_hiddens": [56, 56],
        "use_lstm": False
        # "max_seq_len": 3,
    },
    # Coefficient of the entropy regularizer.
    "entropy_coeff": 0.0,
    # Decay schedule for the entropy regularizer.
    "entropy_coeff_schedule": None,
    # PPO clip parameter.
    "clip_param": 0.3,
    # Clip param for the value function. Note that this is sensitive to the
    # scale of the rewards. If your expected V is large, increase this.
    "vf_clip_param": 50000.0,
    # If specified, clip the global norm of gradients by this amount.
    "grad_clip": None,
    # Target value for KL divergence.
    "kl_target": 0.01,
    # Whether to rollout "complete_episodes" or "truncate_episodes".
    "batch_mode": "complete_episodes",
    # Which observation filter to apply to the observation.
    "observation_filter": "NoFilter",
    # Uses the sync samples optimizer instead of the multi-gpu one. This is
    # usually slower, but you might want to try it if you run into issues with
    # # the default optimizer.
    # "simple_optimizer": False,
    # Whether to fake GPUs (using CPUs).
    # Set this to True for debugging on non-GPU machines (set `num_gpus` > 0).
    # "_fake_gpus": True,
    "num_gpus": 1,
    # Use the connector server to generate experiences.
    "input": (
        lambda ioctx: PolicyServerInput(ioctx, '127.0.0.1', 55558)
    ),
    # Use a single worker process to run the server.
    "num_workers": 0,
    # Disable OPE, since the rollouts are coming from online clients.
    "input_evaluation": [],
    # "callbacks": MyCallbacks,
    "env_config": {"sleep": True,},
    "framework": "tf",
    # "eager_tracing": True,
    "explore": True,
    "exploration_config": {
        "type": "Curiosity",  # <- Use the Curiosity module for exploring.
        "eta": 1.0,  # Weight for intrinsic rewards before being added to extrinsic ones.
        "lr": 0.001,  # Learning rate of the curiosity (ICM) module.
        "feature_dim": 512,  # Dimensionality of the generated feature vectors.
        # Setup of the feature net (used to encode observations into feature (latent) vectors).
        "inverse_net_hiddens": [64],  # Hidden layers of the "inverse" model.
        "inverse_net_activation": "relu",  # Activation of the "inverse" model.
        "forward_net_hiddens": [64],  # Hidden layers of the "forward" model.
        "forward_net_activation": "relu",  # Activation of the "forward" model.
        "beta": 0.2,  # Weight for the "forward" loss (beta) over the "inverse" loss (1.0 - beta).
        # Specify, which exploration sub-type to use (usually, the algo's "default"
        # exploration, e.g. EpsilonGreedy for DQN, StochasticSampling for PG/SAC).
        "sub_exploration": {
            "type": "StochasticSampling",
        }
    },
    "create_env_on_driver": False,
    "log_sys_usage": False,
    "normalize_actions": False
    # "compress_observations": True

})

heroId = 72
DEFAULT_CONFIG["env_config"]["observation_space"] = spaces.Tuple(
            (spaces.Discrete(9),  # final position * (if not 0 means game is over!)
             spaces.Discrete(101),  # health *
             spaces.Discrete(100),  # gold
             spaces.Discrete(11),  # level *
             spaces.Discrete(99),  # remaining EXP to level up
             spaces.Discrete(50),  # round
             spaces.Discrete(2),  # locked in
             spaces.Discrete(2),  # punish for locking in this round
             spaces.Discrete(6),  # gamePhase *
             spaces.MultiDiscrete([250, 3]),  # heroToMove: heroLocalID, isUnderlord
             spaces.Discrete(250),  # itemToMove: localID*,
             spaces.Discrete(3),  # reRoll cost
             spaces.Discrete(2),  # rerolled (item)
             spaces.Discrete(35),  # current round timer
             # below are the store heros
             spaces.MultiDiscrete([heroId, heroId, heroId, heroId, heroId]),
             # below are the bench heroes
             spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]), spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]),
             spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]), spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]),
             spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]), spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]),
             spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]), spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]),
             # below are the board heros
             spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]), spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]),
             spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]), spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]),
             spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]), spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]),
             spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]), spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]),
             spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]), spaces.MultiDiscrete([heroId, 250, 4, 6, 14, 9, 9, 3]),
             # below are underlords to pick (whenever valid) -> underlord ID - specialty
             spaces.MultiDiscrete([5, 3, 5, 3, 5, 3, 5, 3]),
             # below are the items
             spaces.MultiDiscrete([70, 14, 250, 4, 5]), spaces.MultiDiscrete([70, 14, 250, 4, 5]),
             spaces.MultiDiscrete([70, 14, 250, 4, 5]), spaces.MultiDiscrete([70, 14, 250, 4, 5]),
             spaces.MultiDiscrete([70, 14, 250, 4, 5]), spaces.MultiDiscrete([70, 14, 250, 4, 5]),
             spaces.MultiDiscrete([70, 14, 250, 4, 5]), spaces.MultiDiscrete([70, 14, 250, 4, 5]),
             spaces.MultiDiscrete([70, 14, 250, 4, 5]), spaces.MultiDiscrete([70, 14, 250, 4, 5]),
             spaces.MultiDiscrete([70, 14, 250, 4, 5]), spaces.MultiDiscrete([70, 14, 250, 4, 5]),
             # below are the items to pick from
             spaces.MultiDiscrete([70, 70, 70]),
             # below are dicts of other players: slot, health, gold, level, boardUnits (ID, Tier)
             spaces.MultiDiscrete(
                 [9, 101, 100, 11, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4]),
             spaces.MultiDiscrete(
                 [9, 101, 100, 11, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4]),
             spaces.MultiDiscrete(
                 [9, 101, 100, 11, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4]),
             spaces.MultiDiscrete(
                 [9, 101, 100, 11, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4]),
             spaces.MultiDiscrete(
                 [9, 101, 100, 11, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4]),
             spaces.MultiDiscrete(
                 [9, 101, 100, 11, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4]),
             spaces.MultiDiscrete(
                 [9, 101, 100, 11, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4, heroId, 4])
             ))
DEFAULT_CONFIG["env_config"]["action_space"] = spaces.MultiDiscrete([7, 9, 9])

ray.init()
trainer = PPOTrainer(config=DEFAULT_CONFIG, env=RandomEnv)

checkpoint_path = "checkpoints/"
checkpoint1 = "checkpoint_000001/checkpoint-1"
fullpath1 = checkpoint_path + checkpoint1

checkpoint2 = "checkpoint_000002/checkpoint-2"
fullpath2 = checkpoint_path + checkpoint2

sum1 = 0
sum2 = 0


if os.path.exists(fullpath1):
    print('path FOUND!')
    print("Restoring from checkpoint path", fullpath1)
    trainer.restore(fullpath1)
    temp = trainer.get_policy().model._curiosity_feature_net
    sum1 = sum(v.sum() for v in trainer.get_policy().model._curiosity_feature_net.variables().values())
else:
    print("That path does not exist!")


if os.path.exists(fullpath2):
    print('path FOUND!')
    print("Restoring from checkpoint path", fullpath2)
    trainer.restore(fullpath2)
    sum2 = sum(v.sum() for v in trainer.get_policy().model._curiosity_feature_net.variables().values())
else:
    print("That path does not exist!")

And this is the error that gets thrown:

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2020.3\plugins\python-ce\helpers\pydev\pydevd.py", line 1477, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2020.3\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/denys/Documents/GitHub/Underlords/code/tester_server.py", line 198, in <module>
    sum1 = sum(v.sum() for v in trainer.get_policy().model._curiosity_feature_net.variables().values())
AttributeError: 'list' object has no attribute 'values'

I have also attached this image:


Which shows the properties. If you would like, I can provide the 2 checkpoint folders for you to help me out some more (with local setup on your end :stuck_out_tongue: )

@Denys_Ashikhin,

Variables is a list and not a dictionary. It is also a symbolic variable and not a concrete tensor. I created a colab notebook here with how to get the weights.

Before version 1.5.0 it looks like there is a bug in the TF version that is not updating the variables for the “_curiosity_feature_net” of the Curiosity exploration module.

But in 1.6.0 and nightly it is updating.

Oh wow, thanks a bunch - def works now. The sums are different. I also tried swapping _curiosity_feature_net with _curiosity_forward_fcnet except it didn’t work. However, as long as the feature net is being trained, it’s safe to assume that the rest is as well?

P.S.
I really appreciate your help throughout the months, giving me the will to grapple with this project on and on