[RLlib] Impossible actions

Hi there, I’d like to set in a multienv impossible actions. I’ve read the doc (https://docs.ray.io/en/master/rllib-models.html#variable-length-parametric-action-spaces) but I don’t understand the purposes of avail_actions and action_mask. Could someone explain it to me please ?

I’m in a similar situation. Disclaimer: I know very little about RL, this is just what I’ve pieced together over a few hours googling.

avail_actions seems to be there for action embeddings. If you follow links in the docs enough, you’ll get to ParametricActionsCartPole. action_mask is what we really want. Unfortunately, this example interweaves it with action embedding.

I would imagine you could delete self.action_assignments and its friends to get to base, mask-only functionality. You’d also need to modify ParametricActionsModel, since it expects avail_actions in observations and uses it to compute intent_vector, and thus action_logits.

The theory here seems simple–to mask, just intercept forward calls and make the logits for masked/invalid actions very negative. I’m not sure why I can’t crack it. Probably a silly dimensions issue.

There’s a good blog post on this, but it only has one line on avail_actions:

The available actions correspond to each of the five items the agent can select for packing.

The author seems to work around avail_actions rather than excising it; they always set it to ones. Maybe that’s the easier approach.

If any maintainers read this, I’d love to see an example with action masking and embedding separated. I’m sure it’s painfully obvious to experts how to separate them.

1 Like

Thank you very much for your answer, I’ve read through the article, but unfortunately it doesn’t explain the tricky parts at all. I still have no idea what action embedding is. I manage to mask out impossible actions by using action_mask like that :

    inf_mask = torch.clamp(torch.log(action_mask), FLOAT_MIN, FLOAT_MAX)
    return output+inf_mask, []

(it’s in an actor-critic network, output are the logits behind the policy).
But I wonder if I’m not missing something important to make everything work with avail_actions and actions embedding.

Yeah, I sympathize. I still don’t quite grok, but I did find this post a bit enlightening: https://neuro.cs.ut.ee/the-use-of-embeddings-in-openai-five/

1 Like

I love your article!! It’s been a long time since I wanted to have an application of attention mechanisms to reinforcement learning. And also to have an application of reinforcement learning to a complex and variable space of observation as well as to a complex and variable space of action.

This “available actions thing” seems to be helpful when you have to handle huge action spaces and/or a varying number of available actions during steps.
A small example how I interpret it:
all_actions = {0, 1, 2, 3, 4, 5}, n=6 total number of actions, action_embedding_size=2
=> action embedding matrix E is 6x2
m=3 < n available action at a specific timestep, e.g. avail_actions=(0, 2, 4)
=> action embedding for avail_actions is a matrix E* composed of 1st, 3rd and 5th row of E
Finally, you calculate the dot product of the intent vector from the NN (with action_embedding_size) and E*

Code example showing the embeddings:

import numpy
import tensorflow as tf
model = tf.keras.Sequential()
embed = tf.keras.layers.Embedding(6, 2)
model.add(embed)
model.summary()
print(embed.get_weights())
input = numpy.asarray_chkfinite([0, 2, 4])
model.compile("rmsprop", "mse")
out = model.predict(input)
for i in range(out.shape[0]):
    print("{}     {}".format(input[i], out[i]))

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 2)           12
=================================================================
Total params: 12
Trainable params: 12
Non-trainable params: 0
_________________________________________________________________
[array([[-0.04093417, -0.02362244],
       [-0.01528452, -0.02044444],
       [ 0.04733466,  0.01246139],
       [ 0.01975517,  0.02948004],
       [ 0.03812562,  0.0137356 ],
       [ 0.04121368, -0.01421856]], dtype=float32)]
2021-04-20 15:25:32.082275: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
0     [[-0.04093417 -0.02362244]]
2     [[0.04733466 0.01246139]]
4     [[0.03812562 0.0137356 ]]

I authored a paper that has heavy use of invalid action making in complex action spaces.

All the examples are using Griddly and RLLib.
Paper: [2104.07294] Generalising Discrete Action Spaces with Conditional Action Trees
RLLib Code: GitHub - Bam4d/conditional-action-trees: Example Code for the Conditional Action Trees Paper

Might shed some light on action masking and why its required and how you can apply it.

2 Likes