[rllib] Dict Action Space and Custom Model

Hi there,

I’m sure it is a rather simple thing, but I did not find an example for it. The default Model can handle Dict action spaces. What I want to know is, how do I output dict action spaces in my custom model?

I tried it with directly outputting the dict, which results in

File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/models/torch/torch_action_dist.py", line 386, in __init__
    inputs = torch.from_numpy(inputs)
TypeError: expected np.ndarray (got dict)

Which is not surprising, as I in fact outputted a dict. Should I flatten it? Is there already a function in rllib doing this (such that my outputs align) or should one always write a custom action distribution?

Thank you for answering. (Especially if it is obvious to you)

Would you mind sharing the code of your custom model?

sure. Maybe you can spot a mistake in my thought process.

I know that the expected number of outputs should be 64. I do not get how to transform my dict into these dimensions. I assume it should be flattened as the flattened dimension of my dict would be 64 but i want to make sure, this is correct

import torch
from torch import nn
import numpy as np

from .._net_utils.task import BaseTask

import ray
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork

class FloodingTorchModel(TorchModelV2, nn.Module):
    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
        print("-------building model")
        print("obs_space", obs_space.original_space)
        print("action_space", action_space)
        print("num_outputs", num_outputs)
        print("model_config", model_config)
        super().__init__(obs_space, action_space, num_outputs, model_config,name)

        # get properties/dimension of current task
        self.task = model_config.get("task", BaseTask())

        self.flat_space = obs_space
        self.obs_space= obs_space.original_space
        self.action_space = action_space

        self.value = None

        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=2)



        self.tohidden = nn.Linear(self.flat_space.shape[0],self.hidden_dim)
        self.hidden_layer = nn.Linear(self.hidden_dim,self.hidden_dim)

        self.hidden2node_state = nn.Linear(self.hidden_dim,self.task.node_emb_space.length)
        self.hidden2action = nn.Linear(self.hidden_dim,2)
        self.hidden2send = nn.Linear(self.hidden_dim,2)
        self.hidden2msg = nn.Linear(self.hidden_dim,4)
        self.hidden2value = nn.Linear(self.hidden_dim,1)

    def forward(self, input_dict, state, seq_lens):
        obs = input_dict["obs_flat"]
        # Store last batch size for value_function output.
        self._last_batch_size = obs.shape[0]
        # to use original observation space

        # observation = input_dict["obs"]

        # sensor = observation["sensor"]
        # node_state = observation["node_state"]
        # edge_emb = observation["edge_emb"]
        # msg = observation["msg"]
        # ids = observation["id"]
        # time = observation["time"]

        # print("sensor shape", sensor.shape)
        # print("node_state shape", node_state.shape)
        # print("msg shape", len(msg),msg[0].shape)

        hidden = self.tohidden(obs)
        hidden = self.relu(hidden)
        hidden = self.hidden_layer(hidden)

        node_state = self.hidden2node_state(hidden)
        action = self.hidden2action(hidden)

        send_msg = self.hidden2send(hidden)
        msg = self.hidden2msg(hidden)

        send_msg = torch.unsqueeze(send_msg,1).expand(-1,self.task.max_degree,-1)
        msg = torch.unsqueeze(msg,1).expand(-1,self.task.max_degree,-1)

        action = {
            'node_state': node_state,
            'action': action,
            'send_msg': send_msg,
            'msg': msg,

        print("-----action shape")
        for key,val in action.items():
            print(f"key: {key} value_shape: {val.shape}")

        self.value = self.hidden2value(hidden)

        #the action should be processed/ changed
        #how do i know the order of the dict when flattening?
        return action, state

    def value_function(self):
        return self.value

Thank you very much

How do you know this? Is action_space defined on your environment?

I don’t know enough about RLlib to know if dict-based action space is supported. However my intuition is that the output of forward() needs to be torch variables because they are used in the loss function calculation and the gradient tape. So if you are using dict, then you probably need a custom loss function.

Yes, the action space is defined in my environment as follows (not that it really matters):

#logits a 2, c 2, (each tuple space is 10 long) b 10 * 4, d 10 * 2

Dict(a:Discrete(2), b:Tuple(Discrete(4), Discrete(4), Discrete(4), Discrete(4), Discrete(4), Discrete(4), Discrete(4), Discrete(4), Discrete(4), Discrete(4)), c:Discrete(2), d:Tuple(Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2)))

To fix it i now use the following helper method

class FloodingTorchModel(TorchModelV2, nn.Module):
    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
    self.action_space = action_space
    self.order = list(sorted(self.action_space.spaces.keys()))

def forward(self, input_dict, state, seq_lens):
        action = self.dict_action2preprocessed_action(action)
def dict_action2preprocessed_action(self,action):
        The action has to be a flatended tensor 
        In order to make sure the flattened elements allign we use this converter

        stack = []
        for key in self.order:
            stack.append(torch.flatten(action[key], start_dim=1))

        cat = torch.cat(stack,1)
        print("cat shape ", cat.shape)
        return cat

One should note, that rllib at the moment alphabetically orders dict keys.

You are right, [BATCH, num_outputs] is expected from forward. It seems to run without a custom loss function.

Thank you for your help

When using dict action spaces, your model should output a flat tensor, which will then be passed into a MultiActionDistribution for action sampling. This sampling step then returns a dict.
The alphabetic sorting is potentially a problem, however, it’s forced upon RLlib via gym’s very own Dict space handling (Dict.spaces is an OrderedDict).

If you check the code in MultiActionDistribution (ray/torch_action_dist.py at master · ray-project/ray · GitHub), you will see that we create an alphabetically sorted action_space_struct dict, which we then use to regenerate the action dict from your flat tensor outputs.

In other words, as long as you return from your model a tensor that is sorted alphabetically according your dict (print out self.action_space_struct in the MultiActionDistribution to see what the exact order should be in case you have additional nesting going on), it’ll be fine.
Alternatively, you can use a custom action distribution, which then would handle your model’s output (whatever that would be, e.g. a dict), but then you would be responsible for the “handover” between model and action distribution.