I am confused on a conceptual scale how I would be able to model a Multi-agent reinforcement learning when each agent performing an action would take different durations to complete the action. This means that a certain action is performed over multiple steps and the learning sample would have that action attached to it (with different observations and rewards, possibly).
An example of this situation would be where vehicles on a 2-lane road can perform lane changing actions, but each of these actions may take anywhere between 2 - 5 seconds (or learning steps) to complete. So what action would need to be passed at every step for RLlib’s PPO algorithm? Is it even possible to do this? Or do all these agents have to have the same action duration / step length for any RL algorithm to work?
I would greatly appreciate if anyone could point me in the right direction on bypassing this mental block, it is driving me crazy. Thank you.
Hi @hridayns , thanks for posting this interesting question!
If I understand this correctly, this (different actions taking a different number of timesteps) would be something that your environment needs to handle/simulate. RLlib’s multi-agent env API allows you to do that as follows:
Assuming you have 2 agents (cars) that act in a traffic simulator. Your environment at each
step() should only always return those agents’ observations in the observation dict for whom it expects actions next.
For example, see the following two “timelines”:
ts: 0 1 2 3 4
car1: a1 .... a2 .... a3 ---- ---- ---- a4
car2: b1 ---- ---- b2 .... b3 .... b4
The two cars pick actions (car1 a-actions, car2 b-actions) at different timesteps (0 to 4).
Some of these actions take longer than one ts, e.g. b1 takes 2 timesteps.
Your env now has to do the following:
- Upon reset (ts=0), both cars’ initial observations need to be present in the observation dict, such that RLlib knows to produce actions for both cars.
- RLlib produces a1 and b1 and sends them back into the env’s
step() method, which now must return only car1’s observation in the obs dict (the env only expects car1 to act next, not car2, as car2’s action take a little longer).
- RLlib computes a2 for car1 and sends it back to the env via its
- The env now should add both cars’ observations to the returned obs dict, b/c it expects both cars to compute another action.
- RLlib computes actions a3 and b2 for the two cars …
Hope this makes sense.
Here is the page in our documentation that explains this a bit further. There is an example for turn-based games, which is quite similar to your setup (actually, your setup is even more complex/chaotic :), but should work either way).