Multi-agent setting different step sizes for agents and how actions are passed?

hridayns · April 24, 2022, 6:11pm

Hello,

I am confused on a conceptual scale how I would be able to model a Multi-agent reinforcement learning when each agent performing an action would take different durations to complete the action. This means that a certain action is performed over multiple steps and the learning sample would have that action attached to it (with different observations and rewards, possibly).

An example of this situation would be where vehicles on a 2-lane road can perform lane changing actions, but each of these actions may take anywhere between 2 - 5 seconds (or learning steps) to complete. So what action would need to be passed at every step for RLlib’s PPO algorithm? Is it even possible to do this? Or do all these agents have to have the same action duration / step length for any RL algorithm to work?

I would greatly appreciate if anyone could point me in the right direction on bypassing this mental block, it is driving me crazy. Thank you.

sven1977 · April 26, 2022, 9:15am

Hi @hridayns , thanks for posting this interesting question!

If I understand this correctly, this (different actions taking a different number of timesteps) would be something that your environment needs to handle/simulate. RLlib’s multi-agent env API allows you to do that as follows:

Assuming you have 2 agents (cars) that act in a traffic simulator. Your environment at each step() should only always return those agents’ observations in the observation dict for whom it expects actions next.

For example, see the following two “timelines”:

ts:    0       1       2       3        4
car1: a1 .... a2 .... a3 ---- ---- ---- a4
car2: b1 ----    ---- b2 ....  b3  .... b4

The two cars pick actions (car1 a-actions, car2 b-actions) at different timesteps (0 to 4).
Some of these actions take longer than one ts, e.g. b1 takes 2 timesteps.

Your env now has to do the following:

Upon reset (ts=0), both cars’ initial observations need to be present in the observation dict, such that RLlib knows to produce actions for both cars.
RLlib produces a1 and b1 and sends them back into the env’s step() method, which now must return only car1’s observation in the obs dict (the env only expects car1 to act next, not car2, as car2’s action take a little longer).
RLlib computes a2 for car1 and sends it back to the env via its step() method.
The env now should add both cars’ observations to the returned obs dict, b/c it expects both cars to compute another action.
RLlib computes actions a3 and b2 for the two cars …

Hope this makes sense.

sven1977 · April 26, 2022, 9:18am

Here is the page in our documentation that explains this a bit further. There is an example for turn-based games, which is quite similar to your setup (actually, your setup is even more complex/chaotic :), but should work either way).

Topic		Replies	Views
Different step space for different agents RLlib	7	838	August 11, 2021
[RLlib] batch size interpretation when training multiple policies RLlib	4	593	July 15, 2021
Individual training regimes in RLLib Multi-Agent RLlib	1	122	February 16, 2024
Asymmetric play multiagent environment RLlib	2	453	January 6, 2022
Step by step way to interact with an environment and update an agent Configure Algorithm, Training, Evaluation, Scaling	1	348	May 23, 2023

Multi-agent setting different step sizes for agents and how actions are passed?

Related topics