RLlib rollout vs stepping the model manually: different outcomes

alainsamjr · October 26, 2021, 3:59pm

Hi,

I’m evaluating a DQN agent using the Ray/RLlib rollout and I compare its behavior to that of a manually stepped model.

I also managed to save the tf.keras.Model object as an h5 file and I could step it manually. The inputs are the same but the Q-values (and hence the actions) are different. For training, I use tf and for the manual rollout, I have tf2 (i.e import tensorflow as tf instead of _, tf, _ = try_import_tf()). I also tried to use tf2 during training but this does not solve the issue: the outputs are still different.

It is worth mentioning that

the actions from the RLlib rollout vs my manual rollout are quite close to each other, so I haven’t ruled out the precision error option.
I’ve been careful enough to take the dueling behavior into account. I use ray.rllib.agents.dqn.dqn_tf_policy.compute_q_values() to compute the Q-values in my model (i.e using the state_score model on top of q_value_head.
I have a custom model that subclasses DistributionalQTFModel and overrides __init__(). The flow is: inputs → custom embedding [also called model_out] → (q_out, state_out). Then I use something similar to compute_q_values(): real_q_values = custom_q_values_fn(q_out, state_out, model_out). This has the expected shape (batch_size, action_space.n).
I save the object tf.keras.Model(inputs, real_q_values).
I take the argmax of the final Q-values (i.e to reproduce explore=False from RLlib).

Has anyone seen a similar problem?

Thanks!

alainsamjr · October 27, 2021, 10:34am

Update

I could isolate the problem a little bit better. When I simultaneously have dueling: true and num_atoms: 51 (any value > 1), the issue shows up. If either dueling: false or num_atoms: 1, then both rollouts give the same action trace.

All in all, it means that the difference is introduced in the compute_q_values() function if dueling: true and num_atoms > 1

support_logits_per_action_mean = tf.reduce_mean(
                    support_logits_per_action, 1
                )
support_logits_per_action_centered = (
    support_logits_per_action - tf.expand_dims(
        support_logits_per_action_mean, 1
        )
    )
support_logits_per_action = (
    tf.expand_dims(state_score, 1) + support_logits_per_action_centered
)
support_prob_per_action = tf.nn.softmax(logits=support_logits_per_action)
value = tf.reduce_sum(input_tensor=z * support_prob_per_action, axis=-1)

This is pretty much the same code so I don’t see what’s wrong.

alainsamjr · October 27, 2021, 12:34pm

After more debugging: it seems that

support_prob_per_action = tf.nn.softmax(logits=support_logits_per_action)

behaves differently in my code and in RLlib. If I change the above line by

support_prob_per_action = tf.exp(support_logits_per_action)/tf.reduce_sum(
          tf.exp(support_logits_per_action), axis=-1, keepdims=True
 )

it just works (i.e both rollouts give the same output). This is probably an axis issue. I’m not sure why, since tf.nn.softmax has a default axis=-1 value, which is what I do in my “handmade” softmax.

alainsamjr · October 27, 2021, 4:21pm

FWIW adding axis=-1 in tf.nn.softmax() also does the trick.

Topic		Replies	Views
Cannot understand how to create custom model for DQN RLlib	2	1481	April 29, 2022
Some issues happened when I test the method of 'loss' in my 'Policy' class RLlib	0	306	December 7, 2023
Changing the sampling mechanism in DQN RLlib	7	429	August 28, 2021
[Contribution] [Help needed] Implementing easy action masking for distributional and dueling DQN RLlib	2	472	February 23, 2023
Custom model for DQN RLlib	3	781	July 20, 2021

RLlib rollout vs stepping the model manually: different outcomes

Related topics