How severe does this issue affect your experience of using Ray?
- High: It blocks me from completing my task.
Hi,
So the AIM is to have a policy_network that will return action and message. The message will be used to communicate with other agents.
@sven1977 is this doable using RLlib
I am trying to implement communication in the MARL setting where the number of agents changes dynamically. The closest communication method in this setting that I came across is CommNet.
My policy network looks like this.
class PolicyNetwork(TorchModelV2, nn.Module):
"""Example of a PyTorch custom model that just delegates to a fc-net."""
def __init__(self, obs_space, action_space, num_outputs, model_config,
name):
TorchModelV2.__init__(self, obs_space, action_space, num_outputs,
model_config, name)
nn.Module.__init__(self)
self.mlp_f = nn.Sequential(some layers)
self.mlp_ae = nn.Sequential(some layers)
self.mlp_msg = nn.Sequential(some layers)
self.mlp_interaction = nn.Sequential(some layers)
self.values = nn.Sequential(some layers)
self._last_value = None
def forward(self, input_dict, state, seq_lens):
features = input_dict["obs"][:,:128+5+3]
message = input_dict["obs"][:,128+5+3:]
# if features[-1,-1] > 0:
# pdb.set_trace()
f_feature = self.mlp_f(features[:,-8:]) # h similar to CommNet h
ae_feature = self.mlp_ae(features) # h similar to CommNet h
# c similar to CommNet c = communication
communication = self.mlp_msg(message)
next_h = ae_feature + f_feature + communication
final_out = self.mlp_interaction(next_h)
self._last_value = self.values(features)
return final_out, [next_h]
def value_function(self):
return torch.squeeze(self._last_value, -1)
The above policy network throws KeyError: 'seq_lens'
I want the policy network to be similar to CommNet architecture. The policy network should be able to pass a message to the environment, where it will be summed and passed back to the policy network in the next step t+1.
Below is the flow chart
1. Reset Env ==> get obs+t = ```[state_t, msg_t]``` where time_step t=0 and msg_0 = [0...0]
2. obs_t ==> policy_net ==> ```[output/action, msg_t+1]``` ==> env ==> new_obs_t+1
3. ```[state_t+1, msg_t+1]``` ==> policy_net ==> ```[output/action, msg_t+2]```
@CodingBurmer Have you implemented something like this.