Apply preprocessor in custom model

My observation is a Dict

{
    'observation': Dict({ .. dictionary with observations }),
    'mask': Box()  # Mask for action masking
}

and I have a custom model

1 class DQNModel(TFModelV2):
2
3    def __init__(self,
4                 obs_space: Space,
5                 act_space: Space,
6                 num_outputs: int,
7                 model_config: Dict,
8                 name: str):
9        super().__init__(obs_space, act_space, num_outputs, model_config, name)
10        orig_space = getattr(obs_space, "original_space", obs_space)
11
12        self.internal_model = FullyConnectedNetwork(
13            orig_space['observation'],
14            act_space,
15            num_outputs,
16            model_config,
17            name + '_internal',
18        )
19
20    def forward(self, input_dict, state, seq_lens):
21        action_mask = input_dict['obs']['mask']
22        logits, _ = self.internal_model({'obs': input_dict['obs']['observation']})
23
24        # Transform 0s in mask into -inf (using tf.float32.min to avoid NaNs)
25        inf_mask = tf.maximum(tf.math.log(action_mask), tf.float32.min)
26        masked_logits = logits + inf_mask
27
28        return masked_logits, state
29
30    def value_function(self):
31        return self.internal_model.value_function()

adapted from the example action_mask_model.py.

However, orig_space['observation'] in line 10 is of type gym.spaces.Dict and it has no shape attribute, so the FullyConnectedNetwork __init__ at line 12 will raise an error.

My understanding is that I need to get the space in orig_space['observation'], preprocess it so that it will be unflattened and pass it to the model in line 12. Is it correct? How can I do this?

Hi @fedetask ,

How does your 'observation': Dict({ .. dictionary with observations }) look?
How does your orig_space look?

Consider using a flattened observation space! RLlib has a utility method for most of the work here: ray.rllib.utils.spaces.space_utils.flatten_space. Or otherwise, you can provide a model that can handle a dictionary as input.

The FullyConnectedNetwork class does not handle automatic flattening of dict spaces, hence the error. It will need an observation space that has a shape!=None so that it can produce a first layer that fits the observations.

Yes you’re right, my spaces are dictionaries so they aren’t flattened by the network, so I’ll resort to using the flattened observation space (which contains the action masks, but that’s not really a big deal).

However, this still does not work because DQN adds a final linear layer on top of the model forward(), so the masking of action_mask_model.py does not work. Do you know how to disable this behavior?

Your model config has a no_final_linear option:

config["model"]["no_final_linear"] = True

This should not have anything to do with DQN per se.

1 Like

The FullyConnectedNetwork looks at your model config and if config["model"]["no_final_linear"] = False, it will not fit the final layer to your provided observation space.

Does this solve your problem?

Yes it does, thanks!

One last question: when using DQN, how does RLlib handles the custom model if it extends TFModelV2 versus extending DistributionalQTFModel?

I see that in parametric_actions_model.py they extend the latter but the implementation is the same (without considering the action embedding).

Independently of what model you extend, if you use DQN, RLlib will try to wrap your model with the DistributionalQTFModel interface. Meaning that it will create a new class that inherits from both your provided class and the DistributionalQTFModel class. It will do this only if your class does not inherit from DistributionalQTFModel anyway.
So I would simply inherit from DistributionalQTFModel to explicitly define the behaviour.

If you don’t want all of this to happen, you will have to call with_updates() on the DQN policy and change it’s make_model method to specify another behaviour. (In future version of rllib, you can simply inherit from the policy)

1 Like

Thanks again! Last question: I understood that I should implement action masking in the get_q_value_distributions() method of DistributionalQTFModel. However, the method only takes as input the model_out, while my action masks are in the observation.

  • Is it safe to store the mask when I receive it in the forward() and then use it in get_q_value_distributions()?
  • If num_atoms == 1, get_q_value_distributions() should return a tuple (action_scores, logits, dist), but only action_scores is actually used, right? And that’s where I should apply the action masking

@arturn hello, can you clarify please in simple words, what’s the difference between class ParametricActionsModel(DistributionalQTFModel) and class ActionMaskModel(TFModelV2) if I use, for example, PPOTrainer()? I need only action masking and I see in ray docs it recommends to use ParametricActionsModel(), but I found ActionMaskModel() also in github examples.

  • rllib/examples/models/parametric_actions_model.py
  • rllib/examples/models/action_mask_model.py

@arturn It seems applying custom model still needs more help… What options or code I need to add in order to make working this observation with masking actions in PPO?

self.observation_space = Dict({
    "mask": Box(0, 1, shape=(self.actions,)),
    "observation": Dict({
        "obs1": Discrete(10),
        "obs2": Box(low=-np.inf, high=np.inf, shape=(10, 10), dtype=np.float32),
    }),
})

I wrapped it in simple code:

import numpy as np
import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.registry import register_env
import gym
from gym.spaces import Box, Dict, Discrete
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
from ray.rllib.utils.framework import try_import_torch
from ray.rllib.utils.torch_utils import FLOAT_MIN

torch, nn = try_import_torch()


# copy pasted from rllib/examples/models/action_mask_model.py
class TorchActionMaskModel(TorchModelV2, nn.Module):
    """PyTorch version of above ActionMaskingModel."""

    def __init__(
        self,
        obs_space,
        action_space,
        num_outputs,
        model_config,
        name,
        **kwargs,
    ):
        orig_space = getattr(obs_space, "original_space", obs_space)
        assert (
            isinstance(orig_space, Dict)
            and "action_mask" in orig_space.spaces
            and "actual_obs" in orig_space.spaces
        )

        TorchModelV2.__init__(
            self, obs_space, action_space, num_outputs, model_config, name, **kwargs
        )
        nn.Module.__init__(self)

        self.internal_model = TorchFC(
            orig_space["actual_obs"],
            action_space,
            num_outputs,
            model_config,
            name + "_internal",
        )


    def forward(self, input_dict, state, seq_lens):
        # Extract the available actions tensor from the observation.
        action_mask = input_dict["obs"]["action_mask"]

        # Compute the unmasked logits.
        logits, _ = self.internal_model({"obs": input_dict["obs"]["actual_obs"]})

        # Convert action_mask into a [0.0 || -inf]-type mask.
        inf_mask = torch.clamp(torch.log(action_mask), min=FLOAT_MIN)

        # Return masked logits.
        return logits + inf_mask, state

    def value_function(self):
        return self.internal_model.value_function()




class MyEnv(gym.Env):

    metadata = {"render.modes": ["human"]}

    def __init__(self):
        super(MyEnv, self).__init__()

        self.actions = 4

        self.action_space = Discrete(self.actions)
        self.observation_space = Dict({
            "action_mask": Box(0, 1, shape=(self.actions,)),
            #"actual_obs": Box(low=-np.inf, high=np.inf, shape=(10, 10), dtype=np.float32),
            "actual_obs": Dict({
                "obs1": Discrete(10),
                "obs2": Box(low=-np.inf, high=np.inf, shape=(10, 10), dtype=np.float32),
            }),
        })
    
    def reset(self):
        return self._make_obs()
    
    def step(self, action):
        return self._make_obs(), 0, False, {}

    def _make_obs(self):
        return {
            "action_mask": np.array([1.0] * self.actions),
            #"actual_obs": np.zeros((10, 10), dtype=np.float32),
            "actual_obs": {"obs1": 1, "obs2": np.zeros((10, 10), dtype=np.float32)},
        }


def main ():

    ray.init()
    select_env = "env-v1"
    register_env(select_env, lambda config: MyEnv())
    config = ppo.DEFAULT_CONFIG.copy()
    config.update({
        "env": select_env,
        "framework": 'torch',
        "log_level": 'DEBUG',
        "model": {
            "custom_model": TorchActionMaskModel,
            # "no_final_linear": False,
        },
    })

    agent = ppo.PPOTrainer(config, env=select_env)
    for _ in range(5):
        agent.train()

if __name__ == "__main__":
    main()

Error is:

prev_layer_size = int(np.product(obs_space.shape))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Hi @sirjay,

You can not use DistributionalQTFModel with the PPO trainer, because it was built for Q-Learning algorithms, which PPO is not.

Have a look at this.

The error you are referring to stems from obs_space.shape being None. This is because you use a Dict obs_space. You can flatten it first to gain a space that has a proper shape.

@fedetask model_out is the output of your underlying model. When computing q_values, you don’t need an action mask, but only the model_out (which can be an observation if you don’t want any layers model).
The action mask would only be needed if you where computing actions, but you are computing action values!

@fedetask Action masking should be applied in the forward method, which is where you have access to your observations and therefore to your available actions!

But the forward() output is not the final output of the model since DistributionalQTFModel can add other layers on top of the model_out produced forward() (see build_action_value() in its __init__()). Or should I never set the model to use these additional layers?

The action mask would only be needed if you where computing actions, but you are computing action values!

I’m not sure I understood this, aren’t actions computed with an argmax operation over action values? Then if I want to apply action masking I need to set to -inf the action values of unavailable actions, so that the argmax operation will not choose them, right?

Hi again,

You have to implement forward() in your own model and extend that model with DistributionalQTFModel. If you follow the idea of the interface that is provided by RLlib, your own model will output masked actions already. The distributional model will use your forward method. The “x” that you find in the build_action_value() method should go to zero if you provide a model_out with -inf because of the softmax function. This is how I read the code at least - I’ve not written it myself. Have your tried masking in forward()?

I’m not sure I understood this, aren’t actions computed with an argmax operation over action values? Then if I want to apply action masking I need to set to -inf the action values of unavailable actions, so that the argmax operation will not choose them, right?

Generally, your are absolutely right about this. This is discussion is purely about where masking can be applied - before (in question) or after (obvious) building the distributions.

Yes, masking the forward works but I’m not using dueling dqn. Basically, DistributionalQTFModel does inputforward()model_out. Now, model_out contains the masked Q values, as I want.

But if q_hiddens is specified, or use_noisy is True, other layers will be added on top of model_out which I guess will break the model, since they will process the masked Q values and produce new values (also, I guess that layers taking tf.float32.min values as input will behave very badly)

@arturn thank you for answer!!

You can not use DistributionalQTFModel with the PPO trainer, because it was built for Q-Learning algorithms, which PPO is not.

  1. I made a type in question, what’s difference between ParametricActionsModel(TFModelV2) and ActionMaskModel(TFModelV2)? If I use PPOTrainer() and need simply masking actions.

The error you are referring to stems from obs_space.shape being None. This is because you use a Dict obs_space. You can flatten it first to gain a space that has a proper shape.

  1. I made flatten in model and error says:
File "/miniforge3/envs/rl/lib/python3.8/site-packages/keras/engine/input_spec.py", line 182, in assert_input_compatibility raise ValueError(f'Missing data for input "{name}". '
ValueError: Missing data for input "observations". You passed a data dictionary with keys ['obs1', 'obs2']. Expected the following keys: ['observations']

How to go further? It seems something is missing, I searched over all github and did not find examples how to manage this.

from gym.spaces import utils
class ActionMaskModel(TFModelV2):
    ...
        assert (
            isinstance(orig_space, Dict)
            and "action_mask" in orig_space.spaces
            and "actual_obs" in orig_space.spaces
        )

        self.internal_model = FullyConnectedNetwork(
            utils.flatten_space(orig_space["actual_obs"]),
            action_space,
            num_outputs,
            model_config,
            name + "_internal",
        )

class MyEnv(gym.Env):
    def __init__(self):
        super(MyEnv, self).__init__()
        self.observation_space_dict = Dict({
            "action_mask": Box(0, 1, shape=(self.actions,)),
            "actual_obs": Dict({
                "obs1": Discrete(5),
                "obs2": Box(low=-np.inf, high=np.inf, shape=(10, 10), dtype=np.float32),
            }),
        })
  1. By the way, instead of using flatten, can I simply move "obs1", "obs2" outside "actual_obs"?
self.observation_space = Dict({
    "action_mask": Box(0, 1, shape=(3,)),
    "actual_obs": Box(low=-np.inf, high=np.inf, shape=(10, 10)),
    "obs1": Discrete(10),
    "obs2": Discrete(10),
    # ...
})

instead:

self.observation_space = Dict({
    "action_mask": Box(0, 1, shape=(3,)),
    "actual_obs": Dict({
        "obs_box": Box(low=-np.inf, high=np.inf, shape=(10, 10)),
        "obs1": Discrete(10),
        "obs2": Discrete(10),
        # ...
    })
})

Does it make sense? Will my network work properly? I checked it and there are no errors during execution.