Available actions with variable-length action embeddings

klausk55 · April 20, 2021, 2:56pm

Hello,

Can RLlib also handle variable-length available actions per step like they used it in OpenAI Five?
That is, using a learned embedding in the model (e.g. tf.keras.layers.Embedding) and thus a variable-length softmax for the final action logits?
I posted a small example of what I intend to do here.

Or must the available actions embedding always have a constant and fixed length (=“num max avail actions”) to work with RLlib? If so, can RLlib only work with a setup similar to the parametric actions cartpole example?

stefanbschneider · May 1, 2021, 4:51pm

I guess you have read the docs corresponding to variable-length actions? RLlib Models, Preprocessors, and Action Distributions — Ray v2.0.0.dev0
I haven’t used variable-length actions myself, but it does seem like having variable-length actions is supported. I am still trying to understand how they work though.

klausk55 · May 4, 2021, 10:21am

Hi @stefanbschneider,
Yes I have read it. In the example in the docs the information about available actions come from the env by an embedding matrix of size (num_max_avail_actions, embed_size). In my understanding, the NN model shall learn an intent vector (of embed_size) which is multiplied with the matrix and finally produces “scores of similarity” of each two vector pairs ending up in an action logits vector of size num_max_avail_actions.

I’m not 100% sure of what I’m saying right now but I believe that RLlib can only have vectors of action logits of a fixed size (num_max_avail_actions == num_outputs). Therefore, I believe that in this case the idea of having variable-length actions rather corresponds to masking out some actions instead of actually having variable-length vectors of action logits.

Again, I’m not 100% sure but I believe in OpenAI Five it should work as follows (or at least this is what I want to do ):
Let’s say from the env there is the information that available actions are {0, 2, 4} out of {0, 1, 2, 3, 4, 5}. Further, we select row 0, 2, 4 of an embedding matrix and then we calculate the dot product between intent vector (of embed_size) and each of the three “embedding vectors” (also of embed_size). Result is a vector of size three which gives us the scores resp. action logits of our three currently available actions. But I believe that this means a variable-length softmax all the time

sven1977 · May 4, 2021, 1:40pm

Hey @klausk55 , what you are saying is all correct. We currently do not support flexible max_num_action values. This is always a fixed number given by the environment. Btw, if you are working out something new (even if a little hacky), feel free to push a PR. We always welcome example scripts that demonstrate that something can be done with RLlib, even if not officially supported.

Our parametric_actions_cartpole exampe works as follows:
ray/rllib/examples/parametric_actions_cartpole.py (also check the model therein and match to the below logic).

B=batch size
Assume: 3 actions (A=3); embedding size = e = 4
Env provides (binary) mask of size A, e.g. [0 1 1]
Env provides embedding matrix M (size=[B, A, e]), e.g. [[0 0 0 0], [0.1, 0.2, 0.3, 0.4], [-0.1, -0.2, -0.3, -0.4]]
Our model outputs a single(!) intent vector V (size=[B, 1, e]), e.g. [0.5, 0.4, 0.3, 0.2]
This intent vector is multiplied (and dim=1 is broadcast from 1 → A) with the embedding matrix from the env:
V (broadcast) * (<-Hadamart product!) M = [B, A, e]
Now, we do something that I don’t understand 100% either: We reduce-sum over the last axis to get [B, A], which are interpreted as discrete action logits and sampled from to yield the actual discrete action to send to the env. In another paper (https://arxiv.org/pdf/1902.00183.pdf) I found that one can actually also learn this mapping (from embedding to action) via supervised learning.

Either way, I think you are on the right track.

klausk55 · May 6, 2021, 4:59pm

Sorry @sven1977, but I have no idea how a flexible/variable-length number of action logits could be supported instead of a fixed number. To me, this means that calculations in ActionDistribution should support flexible/variable-length action distributions. According to statements in this blog post OpenAI Five uses embeddings and a variable-length softmax to get their action distribution, anyhoo it works

Nevertheless, I have experimented with a new and changed version of the parametric actions cartpole example, where the embedding is part of the model and should be learned. Still, the env returns a mask of valid available actions.
So far, my version of the parametric actions cartpole example seems to work, though it is hacky for sure.
I have pushed a PR.

sven1977 · May 13, 2021, 8:00pm

Awesome, thanks a lot for the PR @klausk55 !

Topic		Replies	Views
Variable-length / Parametric Action Spaces RLlib	1	537	August 31, 2021
[RLlib] Impossible actions RLlib	12	4041	May 11, 2022
Action_space with different dimensions at each step RLlib	0	250	June 1, 2023
Coud we use continuous action space for parametric action spaces RLlib	0	260	May 21, 2021
Invalid action masking for variable sized permutation action RLlib	0	209	May 27, 2021

Available actions with variable-length action embeddings

Related topics