Hi @stefanbschneider,

Yes I have read it. In the example in the docs the information about available actions come from the env by an embedding matrix of size (*num_max_avail_actions, embed_size*). In my understanding, the NN model shall learn an intent vector (of *embed_size*) which is multiplied with the matrix and finally produces “scores of similarity” of each two vector pairs ending up in an action logits vector of size *num_max_avail_actions*.

I’m not 100% sure of what I’m saying right now but I believe that RLlib can only have vectors of action logits of a fixed size (*num_max_avail_actions == num_outputs*). Therefore, I believe that in this case the idea of having variable-length actions rather corresponds to masking out some actions instead of actually having variable-length vectors of action logits.

Again, I’m not 100% sure but I believe in OpenAI Five it should work as follows (or at least this is what I want to do ):

Let’s say from the env there is the information that available actions are {0, 2, 4} out of {0, 1, 2, 3, 4, 5}. Further, we select row 0, 2, 4 of an embedding matrix and then we calculate the dot product between intent vector (of *embed_size*) and each of the three “embedding vectors” (also of *embed_size*). Result is a vector of size three which gives us the scores resp. action logits of our three currently available actions. But I believe that this means a variable-length softmax all the time