Based on the built-in preprocessors, you might want to try MultiDiscrete[6, 5]
. This flattens to an 11-element vector. When your env’s step
function receives the action, you will grab the first element as the row and the second as the column.
def step(action):
state.row = action[0]
state.column = action[1]
...
You can also use Tuple(Discrete(6), Discrete(5))
or Dict({'row': Discrete(6), 'col': Discrete(5)})
, which will flatten to the same 11-element vector. Just depends on your preference really. I like using Dicts
since each “channel” of the action space is essentially labeled with the key, and this takes out the guess work (“Wait, is row the first element or the second element?”)
With regards to the invalid cells, it sounds like you want to independently control 30 bits with only 11-bits of information (and the knowledge that the first 6 mean row and the last 5 mean column). Is is possible to capture 2^30 possible independent states with only 11-bits of information (and some simple logic)? Let’s make an 11-elment mask and label them:
row 0 | row 1 | row 2| row 3 | row 4 | row 5 | col 0 | col 1| col 2| col 3 | col 4|
0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 |
This makes it clear that each element of the vector controls the on/off (valid/invalid) state of the entire row/column. Let’s consider three options for a logical operator combining two values: and, or, xor.
If you use and
, this means that both row and column must be on in order for the cell to be on. So you can keep (3,4) off by setting row 4 to on and column 3 to off. However, this also makes everything in column 3 off, so that doesn’t work.
If you use or
, this means that row or column must be on in order for the cell to be on. So you can keep (3,4) off by setting row 4 and column 3 to off. (3,3) can be turned on by setting row 3 to on. However, this means that (2,3) will be on, regardless of what column 2 says.
If you use xor
, this means that row or column must be on in order for the cell to be on, and if both of them are on, then the cell turns off. You can turn off (3,4) by setting row 4 and column 3 to on. Then set column 2 to on, which will turn off (2,4). Turn on rows 1-5 to set (2, 1:5) to off. However, since column 3 is on in order to have (3,4) be off, that means that (3, 1:5) will also turn off. Alternatively, we can make (3,4) off by keeping column 3 and row 4 off. Turn on columns 0, 1, and 4 to make those columns green. Now we can’t make (3,3) green without turning on row 3, which would turn off (0,3), (1,3), and (4,3). So instead we can leave row 4 and column 3 off, then turn on rows 0-3 and 5. Then we can turn on column 2 to make (2,1), (2,2) and (2,5) turn off. However, this would also turn on (2,4) and turn off (2,0), which we don’t want…
Here I gave three examples: and, or, and xor. You can make more complicated logical statements to capture more combinations, but I believe you would end up having to make a logic statement that is essentially equivalent to 2^30
bits of information.