Hi @sven1977. Thanks for the reply. A quick question to clarify the correct self-play scheme.
If “menagerie” is a dictionary of previous policies, “shared_policy_1” is the trained policy (“policies_to_train”: [“shared_policy_1”]) and “shared_policy_2” is the self-play policy does the below code correctly sync the weights?
Also as I understand this is necessary even for zero number of workers. Is this correct?
class MyCallbacks(DefaultCallbacks):
def on_train_result(self, *, trainer, result: dict, **kwargs):
print("trainer.train() result: {} -> {} episodes".format(
trainer, result["episodes_this_iter"]))
menagerie[one_key] = trainer.get_policy("shared_policy_1").get_weights() #saving weights in dictionary
trainer.set_weights({"shared_policy_2": menagerie[some_key]}) #loading weights from dictionary
weights = ray.put(trainer.workers.local_worker().save())
trainer.workers.foreach_worker(
lambda w: w.restore(ray.get(weights))
)
Thanks,
George