Creating saliency maps / activation maximization with trained policy

Hey everyone,

I’ve trained agents in a Multi-Agent environment with PPO (tensorflow) and would like to analyze the behavior. I would like to assess activation maximization and concepts similar to saliency maps in image processing. To do so, I need to calculate the gradients of the input relative to the output. My question would be, if it is possible to achieve these gradient operations in RLLib.

I’ve already tried wrapping the policy.compute_actions in a GradientTape, but this obviously failed due to the internal data conversion / serialization. Also I’ve tried something similar to this:

# policy is initialized and loaded PPO policy
batch = policy._get_dummy_batch_from_view_requirements()

batch['obs'] = np.array([0.1, 0.1, 0.1, 0.1, 0.1, 0.1]).reshape(batch['obs'].shape)

aout = policy.compute_actions(batch['obs'], explore=False)
batch['actions'] = aout[0]

# Tried with and without: 
#batch = policy.postprocess_trajectory(batch)
grads = policy.compute_gradients(batch)

but it crashes in many variations that I’ve tried so far. Also, I’m not sure if all operations that I would like to achieve can be done with compute_gradients.

What would your ideas be to achieve this? Is the best option (if it is even possible) to replicate the PPO network with the weights of the checkpoint and then do all my “weird stuff” on this separate network without utilizing ray/rllib at all?

Thanks in advance!

Hey @abrandenb , could you explain, what errors you see when you call policy.compute_gradients?
I don’t see any obvious flaws in your approach.

(temporary solution at the bottom)

Hi @sven1977 ! I’ve run the exact same code again and now it works. I assume I have corrupted the RLLib instance with one of my previous attempts. Nevertheless, I’m unsure what the gradients correspond to. If I run the code from the original post and visualize the data via

print(grads[1])
for i_, g_ in enumerate(grads[0]):
    print(f"{i_}: {g_.shape}")

I get following output:

{'learner_stats': {'cur_kl_coeff': 0.20000000298023224, 
'cur_lr': 4.999999873689376e-05, 'total_loss': 2005.3091, 
'policy_loss': -0.0, 'vf_loss': 2003.9865, 'vf_explained_var': -1.0, 
'kl': 6.6129913, 'entropy': 0.039684176, 
'entropy_coeff': 0.0, 'model': {}}}
0: (6, 256)
1: (256,)
2: (6, 256)
3: (256,)
4: (256, 256)
5: (256,)
6: (256, 256)
7: (256,)
8: (256, 2)
9: (2,)
10: (256, 1)
11: (1,)

To me, it seems this gradient does not only include the policy (but probably also the value network). Assuming the vectors at i correspond to the gradients up to layer i, either grads[0][0] @ grads[0][1] or grads[0][2] @ grads[0][3] should yield a gradient with shape (6,), corresponding to my 6-dimensional observation. However, this results in

grads[0][0] @ grads[0][1] # [1.7060623 1.7060623 1.7060623 1.7060623 1.7060624 1.7060624]
grads[0][2] @ grads[0][3] # [94850.72  94850.72  94850.72  94850.72  94850.734 94850.734]

where the gradients are partially identical. IMO this should not be the case, so I guess I am misunderstanding the output of compute_gradients.

Current solution (very slow):
I’ve managed to achieve what I want, by accessing the TF policy model directly. However, this solution is very slow.

import tensorflow as tf

batch = {
    'obs': np.array([0.1, 0.1, 0.1, 0.1, 0.1, 0.1])[None]
}

sess = policy.get_session()

with policy.model.graph.as_default():
    with sess.as_default():
        with tf.GradientTape() as gtape:
            batch['obs'] = tf.constant(batch['obs'])
            gtape.watch(batch['obs'])
            a_, _ = policy.model(batch)
            g_ = gtape.gradient(a_[:,0], batch['obs'])
        print(tf.abs(g_).eval())
        # [[0.1708263  0.00814375 1.15311623 3.52741694 0.43040144 0.05739328]]

Note how pairwise different the individual gradients are. Generally, this solution only works as temporary, as the .eval() call takes quite long. Thus, I would prefer a method based on the RLLib methods. Additionally, as I see it, using policy.model directly may skip additional non-neural-network operations of the policy and thus lead to wrong results / crashes (e.g. if the policy post-processes the actions somehow).

I hope there is a nice solution to this, since currently my ex post evaluation takes longer than my training :joy:

Thanks!