Decentralized multi agent reinforcement learning

Hi everyone,

My previous discussion has been deleted, and I really would like to know the reason…
Nevertheless, I’d greatly appreciate any suggestion you could provide me.
I should develop a multi agent reinforcement learning system, in which each agent acts independently than others, and each actions should be taken in different moments. So the decision moments are not aligned through different agents.
Is there a way to deal with this type of problem in Ray Rllib, and if so could you please explain me how?

Thank you and have a nice day.
L.E.O.

Hey @LeoLeoLeo,

If I am understanding correctly, you’d like to have agents with different policies learn different actions to take correct?

Ray RLlib supports this by default: each agent can independently learn its own policy without needing centralized training or synchronized actions in a multi-agent setting. Each agent will act based on its own learned policy, allowing them to operate asynchronously. I hope this helps!

Tyler

Hi Tyler,

You understood correctly, and I appreciate your response. However, as a beginner with RLlib, I’m still having trouble understanding how to handle asynchrony between agents.
Specifically, Agent 1 becomes available at time t=1, but Agent 2 isn’t yet free to make a decision. Then, at t=2, Agent 2 becomes available while Agent 1 is still occupied.

I’ve previously worked with synchronous multi-agent setups, where both agents make decisions at the same time. However, I’m unclear about what the state dictionary should contain in an asynchronous setup. Does each agent need to return its state at time t, even if it hasn’t acted? And what about rewards?

Here’s my config dictionary in case it might help:

config = {
    "multiagent": {
        "policies": {
            "agent_1_policy": (None, cd.observation_space['agent_1'], cd.action_space['agent_1'], {"model": {"fcnet_hiddens": [16, 16]}}),
            "agent_2_policy": (None, cd.observation_space['agent_2'], cd.action_space['agent_2'], {"model": {"fcnet_hiddens": [16, 16]}}),
        },
        "policy_mapping_fn": policy_mapping_fn,
    },
    "env": ABiCi_env,
    "env_config": env_config,
    "exploration_config": {
        "type": "StochasticSampling"
    },
    "lr": learning_rate,
    "gamma": discount_factor
}

And here’s my policy_mapping_fn:

def policy_mapping_fn(agent_id, episode, worker, **kwargs):
    if agent_id == 'agent_1':
        return 'agent_1_policy'
    elif agent_id == 'agent_2':
        return 'agent_2_policy'
    else:
        raise ValueError(f"Invalid agent ID: {agent_id}")

Thank you again!

L.E.O.

Hi @LeoLeoLeo,

Although it is not documented well and spread out in the code, conceptually RLlib’s model is straightforward.

  1. On every step, t, that an action is required from an agent, the environment places its state in an observation dictionary under the key for its name.

  2. The agent must provide an action on the next transition update, call to step.

  3. The environment must return a reward but, and this is crucial to your question, it is not required to return a new state for that agent until the environment requires another action for that agent at some future time, t_f.

  4. The environment may return extra rewards between t…t_f. Those rewards accumulate until either t_f or the environment is marked as done or terminated.

  5. Once an agent is marked as done or terminated it cannot appear in the observation dictionary until the environment is reset. I do not know if an agent can continue to accumulate rewards after it is done or terminated that needs to be tested.

Hi @mannyv,
thank you so much for your response. You addressed all my questions one by one, which I really appreciate. You managed to understand my concerns, which weren’t easy to explain. I ran a few tests, and it seems to be working.

Thanks again!

L.E.O.