Separate output heads for different components of action space?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Suppose I have a complex action space, e.g. a Tuple or Dict of two discrete spaces, e.g. Dict({"space1": Discrete(4), "space2": Discrete(6)}). I understand that this gets converted into a 10-node output layer in the model, where the first 4 nodes give the logits for space1, and the remaing 6 nodes give the logits for space2.

My question: Can I also have two separate output layers for the two spaces? For instance, could I have two 256-node fully connected layers; then a 4-node output layer that connects to the output of the second FC layer; and separately another three 64-node FC layers that also connect to the second 256-FC-layer, and in turn a 6-node output layer connecting to the final 64-node FC layer? Graphically, something like this:

                        / 4-output for space1
input - 256FC - 256FC -
                        \ 64FC - 64FC - 64FC - 6-output for space2

I assume this isn’t easily done with the builtin model catalog, right? If I want to do this in a custom model, would I just build all of this, and then concatenate the 4-output and 6-output into a single 10-dim tensor and return that from model.forward()? Is there maybe even an example for something like this somewhere?

Thank you!

Hi @mgerstgrasser,

This is possible and you would do it exactly as you described. I looked at the RLlib example models and I did not see one like this. I have written a few like this but unfortunately I cannot share them here.

It should be pretty straight forward to write. If you run into issues feel free to tag me.

I would implement this as two sequential modules one for the [265,256] another for the [64,64,64,6], a third layer for the 4 output. Then call them in forward just like you model it in the diagram and add a concat operation on the output.

1 Like

Hi Manny,

Thanks! I experimented with it in the mean time as well, and it’s indeed super simple. Like you suggested, I used some existing modules, basically taking ComplexInputNet as a starting point, that already has a lot of the relevant code.

Quick follow-up question if I have your attention: The action from the second output head only has an effect about one step out of 1000 steps, the other 999 steps action2 does nothing. Whether it has an effect or not is given in the observation so in theory the policy should be able to figure that out, but I wonder if getting feedback from those 999 irrelevant steps might still hurt learning. Would it be more efficient to somehow get rid of all the gradients it gets from the other 999 steps? And if so, is there an easy way to do that? One thing I’m thinking is if I could just put a tf.stop_gradient in there, something like this:

action2_logits_layer = tf.stop_gradient((1 - action2_relevant_layer) * action2_logits_layer) + action2_relevant_layer * action2_logits_layer

where action2_relevant_layer is a tf.keras.layers.Input on that particular part of the observation (which is a Box(low=0.0, high=1.0, shape=(1,)) space, taking value 1.0 if action2 is relevant in the curren step and value 0.0 otherwise). Would that work? Is there a better way of doing this?

Thank you so much!

@mgerstgrasser,

Just multiply that output head by zero on steps when those actions are irrelevant. The derivative will be zero and that will stop gradients from flowing back through that layer.

1 Like

One more follow up.
The way this would normally be handled in a layer that has some valid outputs and other invalid outputs is with making. The environment would include a mask indicating which autos were valid and the others would be masked by setting them to a very small number. There are examples of this in the RLlib examples folder.

Another approach that is used when the actions usually have some valid action but sometimes have none is to add one extra action that is a noop action and mask out all but that action.

You could do that in this case but if you really just want any kind of output from that layer on rare occasions then the multiply by zero approach will be sufficient and easier to implement.

1 Like

Oh, yes, duh. :person_facepalming: Of course, since I don’t want gradients and also don’t care about the output in those steps, I can just skip the stop_gradients bit. So just action2_logits_layer = action2_relevant_layer * action2_logits_layer. Much simpler - thank you!!

And yes, good point - I had looked into action masking, but I think it doesn’t really apply here, does it? My understanding there was that action masking is relevant if e.g. you have an action space that’s Discrete(17), but sometimes only actions 1,3,9 are allowed, and sometimes actions 2,4,8,12, etc. - you’d block out the invalid ones. In my case, there’s never any invalid actions, it’s just that sometimes actions don’t have any effect. And crucially of course the action space has multiple components, the other components still matter in those steps, otherwise we could skip those steps altogether. Really good point about noop action though, I could do that here. If I’m thinking about this correctly, then if action2 is just ignored in the environment in those steps, then it shouldn’t make a difference though, right? (Other than perhaps for logging or similar.)

Thank you, in any case, that was super helpful!!