[rllib] impossible actions

Hi there, I’d like to set in a multienv impossible actions. I’ve read the doc (RLlib Models, Preprocessors, and Action Distributions — Ray v2.0.0.dev0) but I don’t understand the purposes of avail_actions and action_mask. Could someone explain it to me please ?

I’m in a similar situation. Disclaimer: I know very little about RL, this is just what I’ve pieced together over a few hours googling.

avail_actions seems to be there for action embeddings. If you follow links in the docs enough, you’ll get to ParametricActionsCartPole. action_mask is what we really want. Unfortunately, this example interweaves it with action embedding.

I would imagine you could delete self.action_assignments and its friends to get to base, mask-only functionality. You’d also need to modify ParametricActionsModel, since it expects avail_actions in observations and uses it to compute intent_vector, and thus action_logits.

The theory here seems simple–to mask, just intercept forward calls and make the logits for masked/invalid actions very negative. I’m not sure why I can’t crack it. Probably a silly dimensions issue.

There’s a good blog post on this, but it only has one line on avail_actions:

The available actions correspond to each of the five items the agent can select for packing.

The author seems to work around avail_actions rather than excising it; they always set it to ones. Maybe that’s the easier approach.

If any maintainers read this, I’d love to see an example with action masking and embedding separated. I’m sure it’s painfully obvious to experts how to separate them.

Thank you very much for your answer, I’ve read through the article, but unfortunately it doesn’t explain the tricky parts at all. I still have no idea what action embedding is. I manage to mask out impossible actions by using action_mask like that :

    inf_mask = torch.clamp(torch.log(action_mask), FLOAT_MIN, FLOAT_MAX)
    return output+inf_mask, []

(it’s in an actor-critic network, output are the logits behind the policy).
But I wonder if I’m not missing something important to make everything work with avail_actions and actions embedding.

Yeah, I sympathize. I still don’t quite grok, but I did find this post a bit enlightening: https://neuro.cs.ut.ee/the-use-of-embeddings-in-openai-five/

1 Like

I love your article!! It’s been a long time since I wanted to have an application of attention mechanisms to reinforcement learning. And also to have an application of reinforcement learning to a complex and variable space of observation as well as to a complex and variable space of action.