Hi,
I’m evaluating a DQN agent using the Ray/RLlib rollout and I compare its behavior to that of a manually stepped model.
I also managed to save the tf.keras.Model
object as an h5
file and I could step it manually. The inputs are the same but the Q-values (and hence the actions) are different. For training, I use tf
and for the manual rollout, I have tf2
(i.e import tensorflow as tf
instead of _, tf, _ = try_import_tf()
). I also tried to use tf2
during training but this does not solve the issue: the outputs are still different.
It is worth mentioning that
- the actions from the RLlib rollout vs my manual rollout are quite close to each other, so I haven’t ruled out the precision error option.
- I’ve been careful enough to take the dueling behavior into account. I use
ray.rllib.agents.dqn.dqn_tf_policy.compute_q_values()
to compute the Q-values in my model (i.e using thestate_score
model on top ofq_value_head
. - I have a custom model that subclasses
DistributionalQTFModel
and overrides__init__()
. The flow is:inputs
→ custom embedding [also calledmodel_out
] → (q_out
,state_out
). Then I use something similar tocompute_q_values()
:real_q_values = custom_q_values_fn(q_out, state_out, model_out)
. This has the expected shape(batch_size, action_space.n)
. - I save the object
tf.keras.Model(inputs, real_q_values)
. - I take the argmax of the final Q-values (i.e to reproduce
explore=False
from RLlib).
Has anyone seen a similar problem?
Thanks!