This “available actions thing” seems to be helpful when you have to handle huge action spaces and/or a varying number of available actions during steps.
A small example how I interpret it:
all_actions = {0, 1, 2, 3, 4, 5}, n=6 total number of actions, action_embedding_size=2
=> action embedding matrix E is 6x2
m=3 < n available action at a specific timestep, e.g. avail_actions=(0, 2, 4)
=> action embedding for avail_actions is a matrix E* composed of 1st, 3rd and 5th row of E
Finally, you calculate the dot product of the intent vector from the NN (with action_embedding_size) and E*
Code example showing the embeddings:
import numpy
import tensorflow as tf
model = tf.keras.Sequential()
embed = tf.keras.layers.Embedding(6, 2)
model.add(embed)
model.summary()
print(embed.get_weights())
input = numpy.asarray_chkfinite([0, 2, 4])
model.compile("rmsprop", "mse")
out = model.predict(input)
for i in range(out.shape[0]):
print("{} {}".format(input[i], out[i]))
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 2) 12
=================================================================
Total params: 12
Trainable params: 12
Non-trainable params: 0
_________________________________________________________________
[array([[-0.04093417, -0.02362244],
[-0.01528452, -0.02044444],
[ 0.04733466, 0.01246139],
[ 0.01975517, 0.02948004],
[ 0.03812562, 0.0137356 ],
[ 0.04121368, -0.01421856]], dtype=float32)]
2021-04-20 15:25:32.082275: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
0 [[-0.04093417 -0.02362244]]
2 [[0.04733466 0.01246139]]
4 [[0.03812562 0.0137356 ]]