RLlib rollout vs stepping the model manually: different outcomes


I’m evaluating a DQN agent using the Ray/RLlib rollout and I compare its behavior to that of a manually stepped model.

I also managed to save the tf.keras.Model object as an h5 file and I could step it manually. The inputs are the same but the Q-values (and hence the actions) are different. For training, I use tf and for the manual rollout, I have tf2 (i.e import tensorflow as tf instead of _, tf, _ = try_import_tf()). I also tried to use tf2 during training but this does not solve the issue: the outputs are still different.

It is worth mentioning that

  1. the actions from the RLlib rollout vs my manual rollout are quite close to each other, so I haven’t ruled out the precision error option.
  2. I’ve been careful enough to take the dueling behavior into account. I use ray.rllib.agents.dqn.dqn_tf_policy.compute_q_values() to compute the Q-values in my model (i.e using the state_score model on top of q_value_head.
  3. I have a custom model that subclasses DistributionalQTFModel and overrides __init__(). The flow is: inputs → custom embedding [also called model_out] → (q_out, state_out). Then I use something similar to compute_q_values(): real_q_values = custom_q_values_fn(q_out, state_out, model_out). This has the expected shape (batch_size, action_space.n).
  4. I save the object tf.keras.Model(inputs, real_q_values).
  5. I take the argmax of the final Q-values (i.e to reproduce explore=False from RLlib).

Has anyone seen a similar problem?




I could isolate the problem a little bit better. When I simultaneously have dueling: true and num_atoms: 51 (any value > 1), the issue shows up. If either dueling: false or num_atoms: 1, then both rollouts give the same action trace.

All in all, it means that the difference is introduced in the compute_q_values() function if dueling: true and num_atoms > 1

support_logits_per_action_mean = tf.reduce_mean(
                    support_logits_per_action, 1
support_logits_per_action_centered = (
    support_logits_per_action - tf.expand_dims(
        support_logits_per_action_mean, 1
support_logits_per_action = (
    tf.expand_dims(state_score, 1) + support_logits_per_action_centered
support_prob_per_action = tf.nn.softmax(logits=support_logits_per_action)
value = tf.reduce_sum(input_tensor=z * support_prob_per_action, axis=-1)

This is pretty much the same code so I don’t see what’s wrong.

After more debugging: it seems that

support_prob_per_action = tf.nn.softmax(logits=support_logits_per_action)

behaves differently in my code and in RLlib. If I change the above line by

support_prob_per_action = tf.exp(support_logits_per_action)/tf.reduce_sum(
          tf.exp(support_logits_per_action), axis=-1, keepdims=True

it just works (i.e both rollouts give the same output). This is probably an axis issue. I’m not sure why, since tf.nn.softmax has a default axis=-1 value, which is what I do in my “handmade” softmax.

FWIW adding axis=-1 in tf.nn.softmax() also does the trick.