Changing the sampling mechanism in DQN

Hello,

I’m running DQN on a Tuple([Box(shape=(1,)), Discrete(n=2)]) action space. If I bucketise the continuous action as Discrete(n=11) (say) and I make the action space the product of both discrete spaces, it becomes Discrete(n=22). In that case, RLlib uses the Categorical TF action distribution and it runs.

I would now like to override the Categorical's sampling operator and basically use the (batch_size, 22) tensor to output the continuous action and the discrete action. I subclassed Categorical and used a custom action distribution that does just that (note that the policy object also needs to be subclassed so that in get_distribution_inputs_and_class we use the right distribution).

In particular, when I print the sampled tensor in my custom _build_sample_op and deterministic_sample functions, I get the expected (and desired) shape (batch_size, 2). However, the environment step function still receives an integer between 0 and 21, meaning that the custom sampling is not used.

I think it is also worth mentioning that since RLlib first creates a fake action based on the action space to get things started, I change the received action x in logp as tf.zeros_like(x).

Is there something I’m missing, e.g something in the rollout worker or the sample batch?

Thanks a lot!

Hi @alainsamjr,

Two quick comments.

  1. You should not have to subclass the policy to checnge the action_distribution. You can use the with_updates method. You can find an example in the documentation here: RLlib Concepts and Custom Algorithms — Ray v2.0.0.dev0

  2. The dummy batch that is used before training is not used to update the policy weights. You should not need to worry about the logp being from dummy samples. There was a bug reported many months ago with that happening but that one was fixed.

If you are still having issues post a sample script with the changes you made.


1 Like

Hi @mannyv,

Thanks for your answer. Regarding your comments:

  1. I am forced to subclass the policy because Categorical is forced in DQN, see here. EDIT: yes, I use with_updates.
  2. The dummy policy batch does not worry me. I’m letting the episode run and I still see the one-dimensional actions.

I’m quite sure that the sampling op is not used. Am I missing something?

Thanks!

I found the issue. RLlib uses ray.rllib.utils.exploration.epsilon_greedy.EpsilonGreedy as default and this does not take into account the action distribution (it assumes a Categorical). The workaround is to subclass this object and override the function _get_tf_exploration_action_op (same for torch). Then, in exploration_config, it is enough to add "type": "<path_to_my_class>".

@alainsamjr,

EpsilonGreedy only works with Discrete action spaces so yeah that makes sense. Glad you found a fix!

@alainsamjr,

There is another approach you could try if you wanted to at some point.

You could turn your environment in to a multiagent environment and then have two virtual agents for each agent in the environment. One agent would map to a policy that produces the discrete actions and the other agent would map to a policy that produces the continuous actions. Then in the environment step function you would merge the two actions.

The discrete action agent could still use DQN but the continuous action would need another policy type since DQN only supports Discrete actions.

1 Like

Hi again @mannyv, that’s a nice idea but in that case we have no shared weights for the 2 actions. This is a bit of an issue since they’re quite dependent of each other!

@alainsamjr

Yeah I totally agree it is not ideal I just wanted to offer it as an approach. You could of course have shared layers between the two policies but that I introduces new issues to deal with.

For DQN you are just stuck because it only supports Categorical action spaces.

But @sven1977, For PPO and other algorithms that accept both Discrete and Continuous action spaces, mixed action spaces are not currently supported but I imagine it is possible to support this by updating the loss function so that it takes into consideration the types of the children spaces in a tuple or dictionary space a then apply the correct loss for each.