Cannot understand how to create custom model for DQN

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am struggling to understand what is the proper way to use a custom model with the DQN algorithm. The steps I’m taking are:

1. Create a class MyModel that extends TFModelV2.

class MyModel(TFModelV2):
    def __init__(self, obs_space: GymDict, act_space: Discrete, num_outputs: int, 
                        model_config: Dict, name: str):
        super().__init__(obs_space, act_space, num_outputs, model_config, name)
        self.internal_model = FullyConnectedNetwork(obs_space, act_space, num_outputs,
            model_config, name + '_internal',
        self.final_layer = tf.keras.layers.Dense(act_space.n, name='q_values', activation=None)
    def forward(self, input_dict, state, seq_lens):
        logits, _ = self.internal_model({'obs': input_dict['obs_flat']})
        q_values = self.final_layer(logits)
        self._value = tf.math.reduce_max(masked_q_values, axis=1)
        return q_values, state

    def value_function(self):
        return self._value
  • How is this handled by the DistributionalQTFModel class? I see that RLlib uses it for the policy even if I’m using the default configuration num_atoms: 1 but I don’t really understand how/where my model interacts with it
  • The num_outputs parameter received in the constructor is wrong, since it refers to the last hidden layer and not to the number of q_values. This breaks the policy since it will expect a different number of outputs (see stack trace below). How does this work? Where and how is the policy created and why does it use the size of the last hidden layer even if no_final_linear is True
  1. Register the model
ModelCatalog.register_custom_model("my_model", MyModel)
  1. Set the model in the configuration:
'dueling': False,  # Not use dueling so no need for separate value branch

'model': {
    'custom_model': 'my_model',
    # If I set False, my model receives True anyway
    # in model_config['no_final_linear]. Why?
    'no_final_linear': True,   
    'fcnet_hiddens': [1024, 1024]

Stack trace of policy trying to use the wrong number of outputs:

(pid=7547)   File "/home/fedetask/Desktop/vtl/venv/lib/python3.9/site-packages/ray/rllib/evaluation/", line 584, in __init__
(pid=7547)     self._build_policy_map(
(pid=7547)   File "/home/fedetask/Desktop/vtl/venv/lib/python3.9/site-packages/ray/rllib/evaluation/", line 1384, in _build_policy_map
(pid=7547)     self.policy_map.create_policy(name, orig_cls, obs_space, act_space,
(pid=7547)   File "/home/fedetask/Desktop/vtl/venv/lib/python3.9/site-packages/ray/rllib/policy/", line 133, in create_policy
(pid=7547)     self[policy_id] = class_(
(pid=7547)   File "/home/fedetask/Desktop/vtl/venv/lib/python3.9/site-packages/ray/rllib/policy/", line 238, in __init__
(pid=7547)     DynamicTFPolicy.__init__(
(pid=7547)   File "/home/fedetask/Desktop/vtl/venv/lib/python3.9/site-packages/ray/rllib/policy/", line 295, in __init__
(pid=7547)     action_distribution_fn(
(pid=7547)   File "/home/fedetask/Desktop/vtl/venv/lib/python3.9/site-packages/ray/rllib/agents/dqn/", line 219, in get_distribution_inputs_and_class
(pid=7547)     q_vals = compute_q_values(
(pid=7547)   File "/home/fedetask/Desktop/vtl/venv/lib/python3.9/site-packages/ray/rllib/agents/dqn/", line 352, in compute_q_values
(pid=7547)     dist) = model.get_q_value_distributions(model_out)
(pid=7547)   File "/home/fedetask/Desktop/vtl/venv/lib/python3.9/site-packages/ray/rllib/agents/dqn/", line 184, in get_q_value_distributions
(pid=7547)     return self.q_value_head(model_out)
(pid=7547)   File "/home/fedetask/Desktop/vtl/venv/lib/python3.9/site-packages/keras/engine/", line 739, in __call__
(pid=7547)     input_spec.assert_input_compatibility(self.input_spec, inputs,
(pid=7547)   File "/home/fedetask/Desktop/vtl/venv/lib/python3.9/site-packages/keras/engine/", line 263, in assert_input_compatibility
(pid=7547)     raise ValueError(f'Input {input_index} of layer "{layer_name}" is '
(pid=7547) ValueError: Input 0 of layer "model_1" is incompatible with the layer: expected shape=(None, 1024), found shape=(None, 3)

It might help: If in the model config I set

'fcnet_hiddens': [1024, 1024, 3]  # 3 is the number of actions for an agent

and directly use self.internal_model to compute the q values (i.e. removing self.final_layer), things work but I cannot do this since I will have several agents with a different number of actions each

I finally understood the issue: DQN always adds a linear layer on top of the forward() output. This means that forward() must not return the Q values, but just an embedding that will be transformed into Q values by RLlib.

How to disable this behavior and directly handle Q values in the custom model forward()? This is crucial for action masking (and in fact the RLlib example does not work with DQN)