How severe does this issue affect your experience of using Ray?
- None: Just asking a question out of curiosity
I’m curious… I’m currently diagramming/brainstorming how I would setup a multi-agent training process in which agents would be directed to switch policies mid-episode. I am wondering how feasible this is and how folks might recommend going about it. Here are some more details:
- This will be “hierarchical” in nature - there will be a “Boss” that has 4 different objects to direct. Each object can be assigned to perform 1 of 3 different tasks at any given time.
- I imagine there would be a unique policy for each of the 3 tasks, meaning that Object 1 might start out being tasked with “Task X” and then be directed to switch to “Task Y” mid-episode. Which means I’d need to be able to switch policy training from Policy X to Policy Y.
- Not all the objects have to be working on the same task at the same time, but they can be.
- Not sure it’s relevant or not, but I’ve used PPO for training the tasks individually with success. At this point, the challenge is simultaneously training a leader to determine resource allocation and the individual policies that the objects act with.
- I have considered a single policy with different reward functions based on the assigned task, but that feels messy at best.
I have seen a few examples of setting up this sort of hierarchical learning but I’ve not seen anyone create the policy switching mid-episode. Here’s a link with such an example: https://github.com/DeUmbraTX/practical_rllib_tutorial/blob/main/your_rllib_environment.py Note that the high-level decision only gets made once at the start of the episode and not continuously throughout the episode.
Can anyone share their thoughts on this?