How severe does this issue affect your experience of using Ray?
- High: It blocks me from completing my task.
So the AIM is to have a policy_network that will return action and message. The message will be used to communicate with other agents.
@sven1977 is this doable using RLlib
I am trying to implement communication in the MARL setting where the number of agents changes dynamically. The closest communication method in this setting that I came across is CommNet.
My policy network looks like this.
class PolicyNetwork(TorchModelV2, nn.Module): """Example of a PyTorch custom model that just delegates to a fc-net.""" def __init__(self, obs_space, action_space, num_outputs, model_config, name): TorchModelV2.__init__(self, obs_space, action_space, num_outputs, model_config, name) nn.Module.__init__(self) self.mlp_f = nn.Sequential(some layers) self.mlp_ae = nn.Sequential(some layers) self.mlp_msg = nn.Sequential(some layers) self.mlp_interaction = nn.Sequential(some layers) self.values = nn.Sequential(some layers) self._last_value = None def forward(self, input_dict, state, seq_lens): features = input_dict["obs"][:,:128+5+3] message = input_dict["obs"][:,128+5+3:] # if features[-1,-1] > 0: # pdb.set_trace() f_feature = self.mlp_f(features[:,-8:]) # h similar to CommNet h ae_feature = self.mlp_ae(features) # h similar to CommNet h # c similar to CommNet c = communication communication = self.mlp_msg(message) next_h = ae_feature + f_feature + communication final_out = self.mlp_interaction(next_h) self._last_value = self.values(features) return final_out, [next_h] def value_function(self): return torch.squeeze(self._last_value, -1)
The above policy network throws
I want the policy network to be similar to CommNet architecture. The policy network should be able to pass a message to the environment, where it will be summed and passed back to the policy network in the next step t+1.
Below is the flow chart
1. Reset Env ==> get obs+t = ```[state_t, msg_t]``` where time_step t=0 and msg_0 = [0...0] 2. obs_t ==> policy_net ==> ```[output/action, msg_t+1]``` ==> env ==> new_obs_t+1 3. ```[state_t+1, msg_t+1]``` ==> policy_net ==> ```[output/action, msg_t+2]```
@CodingBurmer Have you implemented something like this.