Action space with multiple output?

I would like to use environment where agent have to chose from action space position in the table below:

I see two options how to do it:

  1. Each position has own number (left image), there are number of output neurons equal rows* column, in this example 30 neurons in the output layer - this is typical rllib solution
  2. The agent returns column number and row number (right image) as action, there are number of output neurons equal rows+columns, in this example 11 neurons in the output layer

Theoretically second solution (right image) is more effective (especially for very big tables) but there is some problem with it:

  1. How to define agent with two outputs (number of column and number of rows in this example)? And how to use it?
  2. How to use action masking in such example if masked fields (grey color) are only part of row and part of column (only green fields are available for the agent)?

I will be grateful for any suggestions or examples even there are not complete solution of the problem? If there is no possibility to do it now in rllib this information will be important for me too.

3 Likes

you can use dict action spaces or tuple action space

1 Like

I’m not sure if is possible to mask some how tuple action space not for whole row all whole column, but thank you for the suggestion.

I’m not sure if I fully understand your example. In the right table, how is that we end up with a output dimension of 11?

Nonetheless you would probably need to implement some sort of masking procedure yourself by implementing a custom policy.

That being said, I’m not sure if this is the right approach to take for your problem.

If you could provide full context on how you’re trying to use RL to model your environment then I could probably help more.

Based on the built-in preprocessors, you might want to try MultiDiscrete[6, 5]. This flattens to an 11-element vector. When your env’s step function receives the action, you will grab the first element as the row and the second as the column.

def step(action):
    state.row = action[0]
    state.column = action[1]
    ...

You can also use Tuple(Discrete(6), Discrete(5)) or Dict({'row': Discrete(6), 'col': Discrete(5)}), which will flatten to the same 11-element vector. Just depends on your preference really. I like using Dicts since each “channel” of the action space is essentially labeled with the key, and this takes out the guess work (“Wait, is row the first element or the second element?”)

With regards to the invalid cells, it sounds like you want to independently control 30 bits with only 11-bits of information (and the knowledge that the first 6 mean row and the last 5 mean column). Is is possible to capture 2^30 possible independent states with only 11-bits of information (and some simple logic)? Let’s make an 11-elment mask and label them:

row 0 | row 1 | row 2| row 3 | row 4 | row 5 | col 0 | col 1| col 2| col 3 | col 4|
  0/1 |   0/1 |  0/1 |   0/1 |   0/1 |   0/1 |   0/1 |  0/1 |  0/1 |   0/1 |  0/1 |

This makes it clear that each element of the vector controls the on/off (valid/invalid) state of the entire row/column. Let’s consider three options for a logical operator combining two values: and, or, xor.

If you use and, this means that both row and column must be on in order for the cell to be on. So you can keep (3,4) off by setting row 4 to on and column 3 to off. However, this also makes everything in column 3 off, so that doesn’t work.

If you use or, this means that row or column must be on in order for the cell to be on. So you can keep (3,4) off by setting row 4 and column 3 to off. (3,3) can be turned on by setting row 3 to on. However, this means that (2,3) will be on, regardless of what column 2 says.

If you use xor, this means that row or column must be on in order for the cell to be on, and if both of them are on, then the cell turns off. You can turn off (3,4) by setting row 4 and column 3 to on. Then set column 2 to on, which will turn off (2,4). Turn on rows 1-5 to set (2, 1:5) to off. However, since column 3 is on in order to have (3,4) be off, that means that (3, 1:5) will also turn off. Alternatively, we can make (3,4) off by keeping column 3 and row 4 off. Turn on columns 0, 1, and 4 to make those columns green. Now we can’t make (3,3) green without turning on row 3, which would turn off (0,3), (1,3), and (4,3). So instead we can leave row 4 and column 3 off, then turn on rows 0-3 and 5. Then we can turn on column 2 to make (2,1), (2,2) and (2,5) turn off. However, this would also turn on (2,4) and turn off (2,0), which we don’t want…

Here I gave three examples: and, or, and xor. You can make more complicated logical statements to capture more combinations, but I believe you would end up having to make a logic statement that is essentially equivalent to 2^30 bits of information.

1 Like

Thank You, I will try it.

An additional follow up: restricting actions as part of the game mechanics is a little tricky, as we see here. One technique that I have seen often and that I employ in most of my games is to give the agent a small penalty if they attempt to move to an invalid cell. The action won’t result in a state change, and over time, the agent will learn to avoid moving onto invalid cells. You have to include the invalidity in the observation so that it knows what’s nearby. I do this a lot and it works great for me. So if working out the action space exactly is too time-consuming, you might just want to put it in the reward function.

@rusu24edward Both reward value modification and masking have adavntages.
I use:

  1. masking as hard constraints in my problem
  2. reward value modification as soft constraints

Masking prevents agent from choosing forbidden states, rewards help to select better solutions from set of available actions.

1 Like