Maybe I can partly answer this! Short answer: No, you can’t directly just pass two algorithms to Tune. If you have two agents and need to train them with different algorithms, then how to best do it depends on whether you need the two agents to learn from the same batch of experiences (my original question), or if you’re happy for them to generate experiences separately.
Note that there are two different “two-trainer” examples:
The first one does separate experiences, the second one does one shared environment and shared experiences.
Separate experiences: This is easier. You train both algorithms completely separately. Each Algorithm has two Policies, but only trains one of them. E.g. DQN has a DQNPolicy and a PPOPolicy, but it uses the PPOPolicy “read-only” to generate experiences. It uses both the policies to generate a sample batch, then it trains the DQNPolicy. Then you sync the DQNPolicy weights to the PPO algorithm’s DQNPolicy, PPO uses both its policies to generate a sample batch, trains its PPOPolicy, then syncs weights back to the DQN algorithm’s PPOPolicy. Rinse and repeat. If you want to pass this into Tune, you could just wrap the workflow in that example into a function trainable.
Shared env and experiences: That’s the example you linked to. This is a lot more difficult, and essentiall you have to write our own Algorithm/Trainer for the specific combination of algorithms you want to use. I would avoid this unless you absolutely have to have both agents act in the same environment. If you want to do any combination other than PPO and DQN, you’d have to start from scratch, basically, and even for PPO+DQN I think the example might be missing a few details.
What’s currently simply not possible in RLlib is to plug-and-play together different Algorithms.
Does your setting hinge on one agent observing how the behavior of the other agent changes as the other agent learns? In many cases, you might get away with the separate-experiences workflow, and it is much easier to do currently. I’d think very carefully if you really need to same-env approach.
If you do, another approach that I discussed with Sven at one point would be to have one of the two Algorithms use an offline input reader. So something like the separate-experiences workflow, you have the DQN algorithm generate experiences, then grab them somehow and feed those same experiences into the PPO algorithm. That may or may not be easier than a custom Algorithm.
Ah, are you looking at something like “How did the specific action sampled by one agent influence the learning of the other agent?” (as opposed to the policy of the agent / expectation of actions) - then yes, that would be another scenario where separate experiences might not work.