Creating saliency maps / activation maximization with trained policy

abrandenb · February 2, 2021, 2:57pm

Hey everyone,

I’ve trained agents in a Multi-Agent environment with PPO (tensorflow) and would like to analyze the behavior. I would like to assess activation maximization and concepts similar to saliency maps in image processing. To do so, I need to calculate the gradients of the input relative to the output. My question would be, if it is possible to achieve these gradient operations in RLLib.

I’ve already tried wrapping the policy.compute_actions in a GradientTape, but this obviously failed due to the internal data conversion / serialization. Also I’ve tried something similar to this:

# policy is initialized and loaded PPO policy
batch = policy._get_dummy_batch_from_view_requirements()

batch['obs'] = np.array([0.1, 0.1, 0.1, 0.1, 0.1, 0.1]).reshape(batch['obs'].shape)

aout = policy.compute_actions(batch['obs'], explore=False)
batch['actions'] = aout[0]

# Tried with and without: 
#batch = policy.postprocess_trajectory(batch)
grads = policy.compute_gradients(batch)

but it crashes in many variations that I’ve tried so far. Also, I’m not sure if all operations that I would like to achieve can be done with compute_gradients.

What would your ideas be to achieve this? Is the best option (if it is even possible) to replicate the PPO network with the weights of the checkpoint and then do all my “weird stuff” on this separate network without utilizing ray/rllib at all?

Thanks in advance!

sven1977 · February 17, 2021, 9:43am

Hey @abrandenb , could you explain, what errors you see when you call policy.compute_gradients?
I don’t see any obvious flaws in your approach.

abrandenb · February 17, 2021, 11:09am

(temporary solution at the bottom)

Hi @sven1977 ! I’ve run the exact same code again and now it works. I assume I have corrupted the RLLib instance with one of my previous attempts. Nevertheless, I’m unsure what the gradients correspond to. If I run the code from the original post and visualize the data via

print(grads[1])
for i_, g_ in enumerate(grads[0]):
    print(f"{i_}: {g_.shape}")

I get following output:

{'learner_stats': {'cur_kl_coeff': 0.20000000298023224, 
'cur_lr': 4.999999873689376e-05, 'total_loss': 2005.3091, 
'policy_loss': -0.0, 'vf_loss': 2003.9865, 'vf_explained_var': -1.0, 
'kl': 6.6129913, 'entropy': 0.039684176, 
'entropy_coeff': 0.0, 'model': {}}}
0: (6, 256)
1: (256,)
2: (6, 256)
3: (256,)
4: (256, 256)
5: (256,)
6: (256, 256)
7: (256,)
8: (256, 2)
9: (2,)
10: (256, 1)
11: (1,)

To me, it seems this gradient does not only include the policy (but probably also the value network). Assuming the vectors at i correspond to the gradients up to layer i, either grads[0][0] @ grads[0][1] or grads[0][2] @ grads[0][3] should yield a gradient with shape (6,), corresponding to my 6-dimensional observation. However, this results in

grads[0][0] @ grads[0][1] # [1.7060623 1.7060623 1.7060623 1.7060623 1.7060624 1.7060624]
grads[0][2] @ grads[0][3] # [94850.72  94850.72  94850.72  94850.72  94850.734 94850.734]

where the gradients are partially identical. IMO this should not be the case, so I guess I am misunderstanding the output of compute_gradients.

Current solution (very slow):
I’ve managed to achieve what I want, by accessing the TF policy model directly. However, this solution is very slow.

import tensorflow as tf

batch = {
    'obs': np.array([0.1, 0.1, 0.1, 0.1, 0.1, 0.1])[None]
}

sess = policy.get_session()

with policy.model.graph.as_default():
    with sess.as_default():
        with tf.GradientTape() as gtape:
            batch['obs'] = tf.constant(batch['obs'])
            gtape.watch(batch['obs'])
            a_, _ = policy.model(batch)
            g_ = gtape.gradient(a_[:,0], batch['obs'])
        print(tf.abs(g_).eval())
        # [[0.1708263  0.00814375 1.15311623 3.52741694 0.43040144 0.05739328]]

Note how pairwise different the individual gradients are. Generally, this solution only works as temporary, as the .eval() call takes quite long. Thus, I would prefer a method based on the RLLib methods. Additionally, as I see it, using policy.model directly may skip additional non-neural-network operations of the policy and thus lead to wrong results / crashes (e.g. if the policy post-processes the actions somehow).

I hope there is a nice solution to this, since currently my ex post evaluation takes longer than my training

Thanks!

ekblad · October 6, 2022, 5:11pm

Does anyone have an update on this? I could really use a scalable way to create and save saliency maps (gradient of policy output w.r.t. image input). for some reason I cannot prevent adding vars/operations to the existing tf graph loaded from a checkpoint. @sven1977 @abrandenb

Topic		Replies	Views
How to deploy a trained Ray RLlib PPO policy/model in multi-agent-case? RLlib	5	518	November 10, 2021
External On-Policy Actions in PPO RLlib	3	585	June 23, 2021
Compute/display actions from ray.tune RLlib	10	1506	March 30, 2021
Question related to inference in RLlib RLlib	5	571	May 13, 2021
Score the trained policy by ray RLlib	2	243	June 25, 2021

Creating saliency maps / activation maximization with trained policy

Related Topics