I have a question about how to implement policy distillation in rllib. Similar to the scheme presented in divide and conquer rl, I have multiple agent learning distinct partitions of the context space in a CMDP (or task in the multitask rl problem). I want to distill these local learner into a central policy as I am training the local learners.
Please note that the central policy would not be interacting with the environment and is being updated by a imitation learning pass on the samples collected by the local learners.
I have seen a similar discussion here, but that idea only works for one teacher policy.
Any suggestions would help.