Hi,
I’m evaluating a DQN agent using the Ray/RLlib rollout and I compare its behavior to that of a manually stepped model.
I also managed to save the tf.keras.Model object as an h5 file and I could step it manually. The inputs are the same but the Q-values (and hence the actions) are different. For training, I use tf and for the manual rollout, I have tf2 (i.e import tensorflow as tf instead of _, tf, _ = try_import_tf()). I also tried to use tf2 during training but this does not solve the issue: the outputs are still different.
It is worth mentioning that
- the actions from the RLlib rollout vs my manual rollout are quite close to each other, so I haven’t ruled out the precision error option.
- I’ve been careful enough to take the dueling behavior into account. I use
ray.rllib.agents.dqn.dqn_tf_policy.compute_q_values()to compute the Q-values in my model (i.e using thestate_scoremodel on top ofq_value_head. - I have a custom model that subclasses
DistributionalQTFModeland overrides__init__(). The flow is:inputs→ custom embedding [also calledmodel_out] → (q_out,state_out). Then I use something similar tocompute_q_values():real_q_values = custom_q_values_fn(q_out, state_out, model_out). This has the expected shape(batch_size, action_space.n). - I save the object
tf.keras.Model(inputs, real_q_values). - I take the argmax of the final Q-values (i.e to reproduce
explore=Falsefrom RLlib).
Has anyone seen a similar problem?
Thanks!