Separate output heads for different components of action space?

mgerstgrasser · November 11, 2022, 8:42pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Suppose I have a complex action space, e.g. a Tuple or Dict of two discrete spaces, e.g. Dict({"space1": Discrete(4), "space2": Discrete(6)}). I understand that this gets converted into a 10-node output layer in the model, where the first 4 nodes give the logits for space1, and the remaing 6 nodes give the logits for space2.

My question: Can I also have two separate output layers for the two spaces? For instance, could I have two 256-node fully connected layers; then a 4-node output layer that connects to the output of the second FC layer; and separately another three 64-node FC layers that also connect to the second 256-FC-layer, and in turn a 6-node output layer connecting to the final 64-node FC layer? Graphically, something like this:

                        / 4-output for space1
input - 256FC - 256FC -
                        \ 64FC - 64FC - 64FC - 6-output for space2

I assume this isn’t easily done with the builtin model catalog, right? If I want to do this in a custom model, would I just build all of this, and then concatenate the 4-output and 6-output into a single 10-dim tensor and return that from model.forward()? Is there maybe even an example for something like this somewhere?

Thank you!

mannyv · November 11, 2022, 10:30pm

Hi @mgerstgrasser,

This is possible and you would do it exactly as you described. I looked at the RLlib example models and I did not see one like this. I have written a few like this but unfortunately I cannot share them here.

It should be pretty straight forward to write. If you run into issues feel free to tag me.

I would implement this as two sequential modules one for the [265,256] another for the [64,64,64,6], a third layer for the 4 output. Then call them in forward just like you model it in the diagram and add a concat operation on the output.

mgerstgrasser · November 12, 2022, 1:05am

Hi Manny,

Thanks! I experimented with it in the mean time as well, and it’s indeed super simple. Like you suggested, I used some existing modules, basically taking ComplexInputNet as a starting point, that already has a lot of the relevant code.

Quick follow-up question if I have your attention: The action from the second output head only has an effect about one step out of 1000 steps, the other 999 steps action2 does nothing. Whether it has an effect or not is given in the observation so in theory the policy should be able to figure that out, but I wonder if getting feedback from those 999 irrelevant steps might still hurt learning. Would it be more efficient to somehow get rid of all the gradients it gets from the other 999 steps? And if so, is there an easy way to do that? One thing I’m thinking is if I could just put a tf.stop_gradient in there, something like this:

action2_logits_layer = tf.stop_gradient((1 - action2_relevant_layer) * action2_logits_layer) + action2_relevant_layer * action2_logits_layer

where action2_relevant_layer is a tf.keras.layers.Input on that particular part of the observation (which is a Box(low=0.0, high=1.0, shape=(1,)) space, taking value 1.0 if action2 is relevant in the curren step and value 0.0 otherwise). Would that work? Is there a better way of doing this?

Thank you so much!

mannyv · November 12, 2022, 1:31am

@mgerstgrasser,

Just multiply that output head by zero on steps when those actions are irrelevant. The derivative will be zero and that will stop gradients from flowing back through that layer.

mannyv · November 12, 2022, 1:42am

One more follow up.
The way this would normally be handled in a layer that has some valid outputs and other invalid outputs is with making. The environment would include a mask indicating which autos were valid and the others would be masked by setting them to a very small number. There are examples of this in the RLlib examples folder.

Another approach that is used when the actions usually have some valid action but sometimes have none is to add one extra action that is a noop action and mask out all but that action.

You could do that in this case but if you really just want any kind of output from that layer on rare occasions then the multiply by zero approach will be sufficient and easier to implement.

mgerstgrasser · November 12, 2022, 1:55am

Oh, yes, duh. Of course, since I don’t want gradients and also don’t care about the output in those steps, I can just skip the stop_gradients bit. So just action2_logits_layer = action2_relevant_layer * action2_logits_layer. Much simpler - thank you!!

And yes, good point - I had looked into action masking, but I think it doesn’t really apply here, does it? My understanding there was that action masking is relevant if e.g. you have an action space that’s Discrete(17), but sometimes only actions 1,3,9 are allowed, and sometimes actions 2,4,8,12, etc. - you’d block out the invalid ones. In my case, there’s never any invalid actions, it’s just that sometimes actions don’t have any effect. And crucially of course the action space has multiple components, the other components still matter in those steps, otherwise we could skip those steps altogether. Really good point about noop action though, I could do that here. If I’m thinking about this correctly, then if action2 is just ignored in the environment in those steps, then it shouldn’t make a difference though, right? (Other than perhaps for logging or similar.)

Thank you, in any case, that was super helpful!!

Topic		Replies	Views
Continuous action space and custom model RLlib	4	1540	July 17, 2021
Rllib extremely complex action space Possible? RLlib	1	261	May 4, 2022
[rllib] Dict Action Space and Custom Model RLlib	5	2456	March 30, 2021
Right way to use tuple action space RLlib	9	1568	September 24, 2021
Action space with multiple output? RLlib	7	1177	July 14, 2022

Separate output heads for different components of action space?

Related topics