Implementation of CommNet or DIAL in RLLIB

How severe does this issue affect your experience of using Ray?

  • High: It blocks me from completing my task.


So the AIM is to have a policy_network that will return action and message. The message will be used to communicate with other agents.

@sven1977 is this doable using RLlib

I am trying to implement communication in the MARL setting where the number of agents changes dynamically. The closest communication method in this setting that I came across is CommNet.

My policy network looks like this.

class PolicyNetwork(TorchModelV2, nn.Module):
    """Example of a PyTorch custom model that just delegates to a fc-net."""

    def __init__(self, obs_space, action_space, num_outputs, model_config,
        TorchModelV2.__init__(self, obs_space, action_space, num_outputs,
                              model_config, name)

        self.mlp_f = nn.Sequential(some layers)
        self.mlp_ae = nn.Sequential(some layers)
        self.mlp_msg = nn.Sequential(some layers)
        self.mlp_interaction = nn.Sequential(some layers)
        self.values = nn.Sequential(some layers)
        self._last_value = None

    def forward(self, input_dict, state, seq_lens):
        features = input_dict["obs"][:,:128+5+3]
        message = input_dict["obs"][:,128+5+3:]
        # if features[-1,-1] > 0:
        #     pdb.set_trace()

        f_feature = self.mlp_f(features[:,-8:])  # h similar to CommNet h
        ae_feature = self.mlp_ae(features)     # h similar to CommNet h

        # c similar to CommNet c = communication
        communication = self.mlp_msg(message)
        next_h = ae_feature + f_feature + communication
        final_out = self.mlp_interaction(next_h)

        self._last_value = self.values(features)

        return final_out, [next_h]

    def value_function(self):
        return torch.squeeze(self._last_value, -1)

The above policy network throws KeyError: 'seq_lens'

I want the policy network to be similar to CommNet architecture. The policy network should be able to pass a message to the environment, where it will be summed and passed back to the policy network in the next step t+1.

Below is the flow chart

1. Reset Env ==> get obs+t = ```[state_t, msg_t]``` where time_step t=0 and msg_0 = [0...0]
2. obs_t ==> policy_net ==> ```[output/action, msg_t+1]``` ==> env ==> new_obs_t+1
3.  ```[state_t+1, msg_t+1]``` ==> policy_net ==> ```[output/action, msg_t+2]```

@CodingBurmer Have you implemented something like this.

Hi @Rohit_Modee ,

Sorry for the delay, we’ve been quite busy for the last release. I’m not an expert on MA Communication RL things. Where is the error thrown? MAML does nothing specific with seq_lens.
What you are describing should be doable with RLlib.
If you can provide a reproduction script, we can debug.