Available actions with variable-length action embeddings


Can RLlib also handle variable-length available actions per step like they used it in OpenAI Five?
That is, using a learned embedding in the model (e.g. tf.keras.layers.Embedding) and thus a variable-length softmax for the final action logits?
I posted a small example of what I intend to do here.

Or must the available actions embedding always have a constant and fixed length (=“num max avail actions”) to work with RLlib? If so, can RLlib only work with a setup similar to the parametric actions cartpole example?

1 Like

I guess you have read the docs corresponding to variable-length actions? RLlib Models, Preprocessors, and Action Distributions — Ray v2.0.0.dev0
I haven’t used variable-length actions myself, but it does seem like having variable-length actions is supported. I am still trying to understand how they work though.

1 Like

Hi @stefanbschneider,
Yes I have read it. In the example in the docs the information about available actions come from the env by an embedding matrix of size (num_max_avail_actions, embed_size). In my understanding, the NN model shall learn an intent vector (of embed_size) which is multiplied with the matrix and finally produces “scores of similarity” of each two vector pairs ending up in an action logits vector of size num_max_avail_actions.

I’m not 100% sure of what I’m saying right now but I believe that RLlib can only have vectors of action logits of a fixed size (num_max_avail_actions == num_outputs). Therefore, I believe that in this case the idea of having variable-length actions rather corresponds to masking out some actions instead of actually having variable-length vectors of action logits.

Again, I’m not 100% sure but I believe in OpenAI Five it should work as follows (or at least this is what I want to do :sweat_smile:):
Let’s say from the env there is the information that available actions are {0, 2, 4} out of {0, 1, 2, 3, 4, 5}. Further, we select row 0, 2, 4 of an embedding matrix and then we calculate the dot product between intent vector (of embed_size) and each of the three “embedding vectors” (also of embed_size). Result is a vector of size three which gives us the scores resp. action logits of our three currently available actions. But I believe that this means a variable-length softmax all the time :man_shrugging:

Hey @klausk55 , what you are saying is all correct. We currently do not support flexible max_num_action values. This is always a fixed number given by the environment. Btw, if you are working out something new (even if a little hacky), feel free to push a PR. We always welcome example scripts that demonstrate that something can be done with RLlib, even if not officially supported.

Our parametric_actions_cartpole exampe works as follows:
ray/rllib/examples/parametric_actions_cartpole.py (also check the model therein and match to the below logic).

  • B=batch size
  • Assume: 3 actions (A=3); embedding size = e = 4
  • Env provides (binary) mask of size A, e.g. [0 1 1]
  • Env provides embedding matrix M (size=[B, A, e]), e.g. [[0 0 0 0], [0.1, 0.2, 0.3, 0.4], [-0.1, -0.2, -0.3, -0.4]]
  • Our model outputs a single(!) intent vector V (size=[B, 1, e]), e.g. [0.5, 0.4, 0.3, 0.2]
  • This intent vector is multiplied (and dim=1 is broadcast from 1 → A) with the embedding matrix from the env:
    V (broadcast) * (<-Hadamart product!) M = [B, A, e]
  • Now, we do something that I don’t understand 100% either: We reduce-sum over the last axis to get [B, A], which are interpreted as discrete action logits and sampled from to yield the actual discrete action to send to the env. In another paper (https://arxiv.org/pdf/1902.00183.pdf) I found that one can actually also learn this mapping (from embedding to action) via supervised learning.

Either way, I think you are on the right track. :slight_smile:


Sorry @sven1977, but I have no idea how a flexible/variable-length number of action logits could be supported instead of a fixed number. To me, this means that calculations in ActionDistribution should support flexible/variable-length action distributions. According to statements in this blog post OpenAI Five uses embeddings and a variable-length softmax to get their action distribution, anyhoo it works :see_no_evil:

Nevertheless, I have experimented with a new and changed version of the parametric actions cartpole example, where the embedding is part of the model and should be learned. Still, the env returns a mask of valid available actions.
So far, my version of the parametric actions cartpole example seems to work, though it is hacky for sure.
I have pushed a PR.

1 Like

Awesome, thanks a lot for the PR @klausk55 !