Hello,
Can RLlib also handle variable-length available actions per step like they used it in OpenAI Five?
That is, using a learned embedding in the model (e.g. tf.keras.layers.Embedding
) and thus a variable-length softmax for the final action logits?
I posted a small example of what I intend to do here.
Or must the available actions embedding always have a constant and fixed length (=“num max avail actions”) to work with RLlib? If so, can RLlib only work with a setup similar to the parametric actions cartpole example?
1 Like
I guess you have read the docs corresponding to variable-length actions? RLlib Models, Preprocessors, and Action Distributions — Ray v2.0.0.dev0
I haven’t used variable-length actions myself, but it does seem like having variable-length actions is supported. I am still trying to understand how they work though.
1 Like
Hi @stefanbschneider,
Yes I have read it. In the example in the docs the information about available actions come from the env by an embedding matrix of size (num_max_avail_actions, embed_size). In my understanding, the NN model shall learn an intent vector (of embed_size) which is multiplied with the matrix and finally produces “scores of similarity” of each two vector pairs ending up in an action logits vector of size num_max_avail_actions.
I’m not 100% sure of what I’m saying right now but I believe that RLlib can only have vectors of action logits of a fixed size (num_max_avail_actions == num_outputs). Therefore, I believe that in this case the idea of having variable-length actions rather corresponds to masking out some actions instead of actually having variable-length vectors of action logits.
Again, I’m not 100% sure but I believe in OpenAI Five it should work as follows (or at least this is what I want to do ):
Let’s say from the env there is the information that available actions are {0, 2, 4} out of {0, 1, 2, 3, 4, 5}. Further, we select row 0, 2, 4 of an embedding matrix and then we calculate the dot product between intent vector (of embed_size) and each of the three “embedding vectors” (also of embed_size). Result is a vector of size three which gives us the scores resp. action logits of our three currently available actions. But I believe that this means a variable-length softmax all the time
Hey @klausk55 , what you are saying is all correct. We currently do not support flexible max_num_action values. This is always a fixed number given by the environment. Btw, if you are working out something new (even if a little hacky), feel free to push a PR. We always welcome example scripts that demonstrate that something can be done with RLlib, even if not officially supported.
Our parametric_actions_cartpole exampe works as follows:
ray/rllib/examples/parametric_actions_cartpole.py (also check the model therein and match to the below logic).
- B=batch size
- Assume: 3 actions (A=3); embedding size = e = 4
- Env provides (binary) mask of size A, e.g. [0 1 1]
- Env provides embedding matrix M (size=[B, A, e]), e.g. [[0 0 0 0], [0.1, 0.2, 0.3, 0.4], [-0.1, -0.2, -0.3, -0.4]]
- Our model outputs a single(!) intent vector V (size=[B, 1, e]), e.g. [0.5, 0.4, 0.3, 0.2]
- This intent vector is multiplied (and dim=1 is broadcast from 1 → A) with the embedding matrix from the env:
V (broadcast) * (<-Hadamart product!) M = [B, A, e]
- Now, we do something that I don’t understand 100% either: We reduce-sum over the last axis to get [B, A], which are interpreted as discrete action logits and sampled from to yield the actual discrete action to send to the env. In another paper (https://arxiv.org/pdf/1902.00183.pdf) I found that one can actually also learn this mapping (from embedding to action) via supervised learning.
Either way, I think you are on the right track.
3 Likes
Sorry @sven1977, but I have no idea how a flexible/variable-length number of action logits could be supported instead of a fixed number. To me, this means that calculations in ActionDistribution
should support flexible/variable-length action distributions. According to statements in this blog post OpenAI Five uses embeddings and a variable-length softmax to get their action distribution, anyhoo it works
Nevertheless, I have experimented with a new and changed version of the parametric actions cartpole example, where the embedding is part of the model and should be learned. Still, the env returns a mask of valid available actions.
So far, my version of the parametric actions cartpole example seems to work, though it is hacky for sure.
I have pushed a PR.
1 Like
Awesome, thanks a lot for the PR @klausk55 !