Multi-Agent Policy Switching

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

I’m curious… I’m currently diagramming/brainstorming how I would setup a multi-agent training process in which agents would be directed to switch policies mid-episode. I am wondering how feasible this is and how folks might recommend going about it. Here are some more details:

  • This will be “hierarchical” in nature - there will be a “Boss” that has 4 different objects to direct. Each object can be assigned to perform 1 of 3 different tasks at any given time.
  • I imagine there would be a unique policy for each of the 3 tasks, meaning that Object 1 might start out being tasked with “Task X” and then be directed to switch to “Task Y” mid-episode. Which means I’d need to be able to switch policy training from Policy X to Policy Y.
  • Not all the objects have to be working on the same task at the same time, but they can be.
  • Not sure it’s relevant or not, but I’ve used PPO for training the tasks individually with success. At this point, the challenge is simultaneously training a leader to determine resource allocation and the individual policies that the objects act with.
  • I have considered a single policy with different reward functions based on the assigned task, but that feels messy at best.

I have seen a few examples of setting up this sort of hierarchical learning but I’ve not seen anyone create the policy switching mid-episode. Here’s a link with such an example: Note that the high-level decision only gets made once at the start of the episode and not continuously throughout the episode.

Can anyone share their thoughts on this?