RLlib Parameter Sharing / MARL Communication

Hey,

can someone explain to me what parameter sharing in rllib really means? Are all agents really using just the same network? Or has every Agent an equal network and the parameters are shared between these networks?

I’m asking because i what to implement a MARL-Algorithms like Bicnet, where all agents use the same network and communicate with each other over a B-RNN layer.

My torch model(BicNet):

class Actor(nn.Module):
def __init__(self, s_dim, a_dim, n_agents):
    super(Actor, self).__init__()

    self.s_dim = s_dim
    self.a_dim = a_dim
    self.n_agents = n_agents

    # input (batch, s_dim) output (batch, 300)

    self.prev_dense = DenseNet(s_dim, HIDDEN_DIM, HIDDEN_DIM // 2, output_activation=None, norm_in=True)
    # input (num_agents, batch, 200) output (num_agents, batch, num_agents * 2)\
    self.comm_net = LSTMNet(HIDDEN_DIM // 2, HIDDEN_DIM // 2, num_layers=1)
    # input (batch, 2) output (batch, a_dim)
    self.post_dense = DenseNet(HIDDEN_DIM + s_dim, HIDDEN_DIM // 2, a_dim, output_activation=nn.Tanh)

def forward(self, x):
    x_s = x
    x = x.view(-1, self.s_dim)
    x = self.prev_dense(x)
    x = x.reshape(-1, self.n_agents, HIDDEN_DIM // 2)
    x = self.comm_net(x)
    x = torch.cat((x, x_s), dim=-1)
    x = x.reshape(-1, HIDDEN_DIM + self.s_dim)
    x = self.post_dense(x)
    x = x.view(-1, self.n_agents, self.a_dim)
    return x


class Critic(nn.Module):
def __init__(self, s_dim, a_dim, n_agents):
    super(Critic, self).__init__()

    self.s_dim = s_dim
    self.a_dim = a_dim
    self.n_agents = n_agents

    # input (batch, s_dim) output (batch, 300)
    self.prev_dense = DenseNet((s_dim + a_dim), HIDDEN_DIM, HIDDEN_DIM // 2, output_activation=None, norm_in=True)
    # input (num_agents, batch, 200) output (num_agents, batch, num_agents * 2)\
    self.comm_net = LSTMNet(HIDDEN_DIM // 2, HIDDEN_DIM // 2, num_layers=1)
    # input (batch, 2) output (batch, a_dim)
    self.post_dense = DenseNet(HIDDEN_DIM + s_dim, HIDDEN_DIM // 2, 1, output_activation=None)

def forward(self, x_n, a_n):
    x = torch.cat((x_n, a_n), dim=-1)
    x = x.view(-1, (self.s_dim + self.a_dim))
    x = self.prev_dense(x)

    x = x.reshape(-1, self.n_agents, HIDDEN_DIM // 2)
    x = self.comm_net(x)
    x = torch.cat((x, x_n), dim=-1)
    x = x.reshape(-1, HIDDEN_DIM + self.s_dim)

    x = self.post_dense(x)
    x = x.view(-1, self.n_agents, 1)
    return x


class DenseNet(nn.Module):
def __init__(self, s_dim, hidden_dim, a_dim, norm_in=False, hidden_activation=nn.ReLU, output_activation=None):
    super(DenseNet, self).__init__()

    self._norm_in = norm_in

    if self._norm_in:
        self.norm1 = nn.BatchNorm1d(s_dim)
        self.norm2 = nn.BatchNorm1d(hidden_dim)
        self.norm3 = nn.BatchNorm1d(hidden_dim)
        self.norm4 = nn.BatchNorm1d(hidden_dim)

    self.dense1 = nn.Linear(s_dim, hidden_dim)
    self.dense1.weight.data = fanin_init(self.dense1.weight.data.size())
    self.dense2 = nn.Linear(hidden_dim, hidden_dim)
    self.dense2.weight.data = fanin_init(self.dense2.weight.data.size())
    self.dense3 = nn.Linear(hidden_dim, hidden_dim)
    self.dense3.weight.data.uniform_(-0.003, 0.003)
    self.dense4 = nn.Linear(hidden_dim, a_dim)

    if hidden_activation:
        self.hidden_activation = hidden_activation()
    else:
        self.hidden_activation = lambda x : x

    if output_activation:
        self.output_activation = output_activation()
    else:
        self.output_activation = lambda x : x

def forward(self, x):
    use_norm = True if (self._norm_in and x.shape[0] != 1) else False

    if use_norm: x = self.norm1(x)
    x = self.hidden_activation(self.dense1(x))
    if use_norm: x = self.norm2(x)
    x = self.hidden_activation(self.dense2(x))
    if use_norm: x = self.norm3(x)
    x = self.hidden_activation(self.dense3(x))
    if use_norm: x = self.norm4(x)
    x = self.output_activation(self.dense4(x))
    return x


class LSTMNet(nn.Module):
def __init__(self, input_size, hidden_size,
             num_layers=1,
             bias=True,
             batch_fisrt=True,
             bidirectional=True):
    super(LSTMNet, self).__init__()

    self.lstm = nn.LSTM(
        input_size=input_size,
        hidden_size=hidden_size,
        num_layers=num_layers,
        bias=bias,
        batch_first=batch_fisrt,
        bidirectional=bidirectional
    )

def forward(self, input, wh=None, wc=None):
    output, (hidden, cell) = self.lstm(input)
    return output
1 Like

@sven1977 are you able to answer this?

1 Like

I also wondered if it would be smarter if I just use the single Agent A2C Algorithm with the custom models shown above. This way the communication through the models B-RNN layers should be ensured, Isn’t it? Or is there a problem with B-RNN layer in RLlib because of its distributed processing :thinking:

@CodingBurmer just asking this out of curiosity to know more about B-RNN layer and associated infrastructure:
Is it compute intensive and needs GPUs for training faster? And thus are you doing multi-node multi-GPU distributed training with Ray or is it a single node with/without GPUs?

1 Like

Hey @CodingBurmer , it depends on your “multiagent” setup.

  1. If you have 2 agents that map to the same policy, then you definitely have parameter sharing :slight_smile: (example: rllib/examples/multi_agent_parameter_sharing.py)
  2. If you have 2 agents that use 2 different policies, you could have:
    a) a “pseudo”-centralized critic (where the central value function used in the loss is not(!) shared between the policies (it’s learnt sparately), but uses both agents’ observation as input).
    Example: rllib/examples/centralized_critic.py
    b) one truly shared layer, where the 2 policies use different models, but these 2 models share one particular layer (via e.g. a global variable).
    Example: rllib/examples/multi_agent_cartpole.py

Also note: We currently do not support differentiable multi-agent communication channels, which I believe your B-RNN is supposed to represent as out-of-the-box solution. We wanted to add that capability to the trajectory view API but have currently deprioritized this project due to lack of resources.
It would be super cool, though, if you could do a PR with a short example of your BicNet MA-comm channel setup. I’m not sure, though, if it would be so trivial to implement as our rollout worker treat every RNN they encounter as time-axis RNNs, rather than “agent-axis” RNNs.

4 Likes

Hey @gaurav, yeah, I would definitely suggest using a GPU for training any LSTM-like net. If you use PPO, IMPALA, or A2C, you can even use multi-GPU with tf (and soon with torch as well once I fix a bug there). I would start with n workers (num_workers) to scale up sample collection (non-GPU) and use 1 GPU (num_gpus=1) for the trainer on the local worker (driver). Then you can play around with the ratio between the two (add more workers or more GPUs).

1 Like

@sven1977 , I had a couple of questions.

My first one is regarding your reply

Also note: We currently do not support differentiable multi-agent communication channels,

Did you mean that ^ for communication between agents of different policies?

Curious because I got this from the docs:

But when When replay_mode=lockstep, RLlib will replay all the agent transitions at a particular timestep together in a batch. This allows the policy to implement differentiable shared computations between agents it controls at that timestep.

Replay_mode lockstep capability was added in this PR, and @ericl mentions in this comment, that some changes make it possible to support arbitrary differentiable communications between agents controlled by the same policy.

My second question is also about lockstep mode:

Some things like prioritized replay and SGD minibatches are difficult to implement in lockstep mode, so those might not be supported right away.

Since PPO uses SGD minibatches, does it mean that one is not supposed to use PPO on a MultiAgentEnv with lockstep mode enabled? (I tried it, and there was no compilation error, no exception raised). I’ve probably misunderstood this, so can you throw some more light on what exactly can & cannot be done using lockstep mode?

3 Likes

Hi @sven1977 ,

I want to ask something more about parameter sharing. As I understand the RLlib implementation of parameter sharing is in the centralised learning case since all the agents use the same policy, but is it decentralised execution?
If yes (for example using parameter sharing with PPO) how is the decentralised execution happening in the actor?

Thanks,
George

1 Like