[RLlib] Multi-headed DQN

Hey @rfali , yeah, this should work with a custom Q-model (just sub-class the DQNTFModel and implement this logic). You’d also probably have to change the DQN loss function, though.

Just create a new DQNTFPolicy via:

MyDQNPolicy = DQNTFPolicy.with_updates(loss=[your own loss function]).

Gamma from the config is only used in the loss function and maybe the n-step postprocessing function, so you’d have to do either n_step=1 or also implement your own postprocessing function, like so:

MyDQNPolicy = DQNTFPolicy.with_updates(loss=[your own loss function], postprocess_fn=[your own postprocessing fn doing n-step with different gammas]).

For action choices, you also may have to specify your own action_distribution_fn:

MyDQNPolicy = DQNTFPolicy.with_updates(loss=[your own loss function], postprocess_fn=[your own postprocessing fn doing n-step with different gammas], action_distribution_fn=[your own action picking and action distribution fn]).

Here is the docstring:

        action_distribution_fn (Optional[Callable[[Policy, ModelV2, TensorType,
            TensorType, TensorType],
            Tuple[TensorType, type, List[TensorType]]]]): Optional callable
            returning distribution inputs (parameters), a dist-class to
            generate an action distribution object from, and internal-state
            outputs (or an empty list if not applicable). If None, will either
            use `action_sampler_fn` or compute actions by calling self.model,
            then sampling from the so parameterized action distribution.