Action masking error

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi, I am trying to implement a simple discrete action masking in rllib. The idea is to have the entire action space (182) available at the beginning of an episode and once the action has been chosen it cannot be picked again, so it’s an ever decreasing action space. I have read through the related examples on GitHub including action_masking.py and action_mask_env.py, as well as some previous posts on the forum so I do have some vague idea. However I keep getting the same error message when I am initiating the algo.

Also is it mandatory to set the framework to “framework”: “tf2” when calling the custom model? You can find snippets of the code below. Thank you very much!

class ActionMaskEnv(gym.Env):
    
    def __init__(self, env_config):
        super(ActionMaskEnv, self).__init__()
        
        # Define observation space
        self.observation_space = Dict({
            "observations": Box(low=0, high=1, shape=(910,), dtype=np.int32),
            "action_mask": Box(0, 1, shape=(182,)),
        })
        
        # Define action space
        self.action_space = Discrete(182)
        
        # action mask (initially all actions possible)
        # updates mask at every step until entire array is zeros
        self.mask = np.ones(182, dtype=np.float32)

.......
class ActionMaskModel(TFModelV2):
    """Model that handles simple discrete action masking.
    This assumes the outputs are logits for a single Categorical action dist.
    Getting this to work with a more complex output (e.g., if the action space
    is a tuple of several distributions) is also possible but left as an
    exercise to the reader.
    """

    def __init__(
            self, obs_space, action_space, num_outputs, model_config, name, **kwargs
    ):
        orig_space = getattr(obs_space, "original_space", obs_space)
        assert (
                isinstance(orig_space, Dict)
                and "action_mask" in orig_space.spaces
                and "observations" in orig_space.spaces
        )
        

        super().__init__(obs_space, action_space, num_outputs, model_config, name)

        self.internal_model = FullyConnectedNetwork(
            orig_space["observations"],
            action_space,
            num_outputs,
            model_config,
            name + "_internal",
        )
        
        print(orig_space["observations"])

         # disable action masking --> will likely lead to invalid actions
        self.no_masking = model_config["custom_model_config"].get("no_masking", False)

    def forward(self, input_dict, state, seq_lens):
        # Extract the available actions tensor from the observation.
        print(input_dict)
        action_mask = input_dict["obs"]["action_mask"]
        print(input_dict["obs"])
        print(action_mask)

        # Compute the unmasked logits.
        logits, _ = self.internal_model({"obs": input_dict["obs"]["observations"]})
        

        # If action masking is disabled, directly return unmasked logits
        if self.no_masking:
            return logits, state

        # Convert action_mask into a [0.0 || -inf]-type mask.
        inf_mask = tf.maximum(tf.math.log(action_mask), tf.float32.min)
        masked_logits = logits + inf_mask

        # Return masked logits.
        return masked_logits, state

    def value_function(self):
        return self.internal_model.value_function()
ModelCatalog.register_custom_model("action_mask_model", ActionMaskModel)
algo = dqn.DQN(env=ActionMaskEnv, config={
    "rollout_fragment_length": 100,
    "env_config": {},
    "hiddens": [],
    "model": {
        "custom_model": "action_mask_model",
    },
    "train_batch_size": 1000,
    "framework": "tf2",
    "horizon": 182,
    "eager_tracing": True,
    "min_train_timesteps_per_iteration": 100,
    "min_sample_timesteps_per_iteration": 2000,
})

I think your policy at some point is producing an action value of 182, which is not a valid action value for discrete(182). The last action value you can have is 181. That’s what the error is hinting at.

Without knowing the details of the env it’s hard to guess what is happening. Could you maybe share a small version of the your env as well as and e2e script that repros your problem?

Thanks

Hi @kourosh,

I have no idea what is going on here but, I was discussing an issue with @Lars_Simon_Zehnder recently and we have both independently debugged cases recently where the policy has exploding loss values which results in NaN logits from the policy which ends up as illegal action values in Discrete spaces (1 larger than the size of the space).

Just a heads up because I think I have seen a couple other posts in the forumns that sound similar.

1 Like

Interesting! @mannyv is there a github issue tracking this? or any repro script that we can look into?

Yup, thanks @mannyv for making it visible to me. That is the error I was debugging for some time. It is also not a very informative one (in TensorFlow not in RLlib). The labels are not the 182 like here - usually the network got after updates NaN weights and a very small learning rate can help here.

This error happened a lot to me - and there is always a danger to it coming up in later iterations. It is not shown in the question, but @rl_rookie, is there a very high loss in TensorBoard and/or probably high grads?

I was talking with @sven1977 in this issue to simplify traceback of such errors (here is a PR that checks numerics in BaseEnvs and another one is coming today checking numerics on the fly in TF). As the source of NaNs/Infs can be manifold (inputs, weigths, losses, etc.).

Thank you @kourosh and @mannyv . I was able to get it working after switching to using PPO instead of DQN and didn’t investigate further after that. But I do believe it’s related to the bug that @mannyv was mentioning as the only difference I made was using a different algo.

2 Likes

Hey everyone, this is indeed a strange issue and yes, it has been following us in one way or another for quite some time w/o a real resolution other than the env and learning to solve the problem that the env poses being unstable. What I would like to understand is the following:
Even if the model parameters go all NaN (due to e.g. an exploding loss), then the logit outputs would also be NaN and the construction of the Discrete distribution object would fail (it would not result in sampling an action that’s one larger than the boundary).
Also, we have imho proper CI tests for these extreme distribution inputs in this file here (ray/test_action_distributions.py at master · ray-project/ray · GitHub) and I was unable to reproduce a distribution-sampling behavior that matches what you see, even when plugging in different extreme values into the distribution’s input (like NaN, -inf, tf.float.min, etc…).

Hi @sven1977 , thanks for jumping onto this. Yes, this error does not make any sense (I debugged this for more than 2 weeks thorugh and could not find the reason for this). Imo this is also a noninformative and even misleading error message from TensorFlow’s sparse_softmax_cross_entropy_with_logits() function - you start searching for these labels, but this is not where the error lays.
Are you sure that this test is really checking for this? I debugged it at this line and used at the first breakpoint hit the following numpy array to feed into the distribution:

extreme_nan = numpy.full((10000,), np.nan, dtype=np.int32)
dist.logp(extreme_nan)

This raises the error I have seen in my experiments and debugged for long and the one in this issue:

tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node
 __wrapped__SparseSoftmaxCrossEntropyWithLogits_device_/job:localhost/replica:0
/task:0/device:CPU:0}} Received a label value of -2147483648 which is outside the 
valid range of [0, 4).  Label values: -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 
-2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483648 -2147483

I don’t see, how we can safeguard the distribution - it is an NaN/Inf error and these simply can happen. These values come from somewhere in the experiment and are highly dependent on the combination of environment, model, dist, hyperparameters, etc… - this might be sensitive It can only be traced back and for this it needs some easy-to-handle tools imo.
I give in my PRs (31431and 31569) some ways to check the env or the tensors in the algorithm at runtime for NaNs/Infs and print them to see what is happening.

In the second PR mentioned I use tf.debugging.check_numerics() and using it as a safeguard for the distribution input works, but puts this op permanently into the graph (we might not want this in such frequently executed functions - it might harm performance). The solution I give puts it only in, if the user wants to debug tensors - usually only then, if there is an error like this one here.

quick question:
why is your horizon made 182 ? Is it because it equals the number of actions here ?

Hi @Archana_R,

The horizon determines the maximum number of steps an epsiode can have. For many environments there is no horizon and the environment returns done=True when some terminating condition occurs.

For other environments that may be true but you want to also limit it so that if it has not terminated after x steps than it will be artificially terminated.

In the example above there is a third case. Here the environment is exactly 182 steps long and is terminated using horizon. It could have returned done=True after 182 steps but they chose to do it this way instead.

Now in this environment there are 182 decisions to be made (actions to take) and each action can only be taken once. This is why the action size and horizon match. That is essentially an accident (feature) of this environment. In most cases the size of the actions space and the horizon will not match up.

1 Like