I’m attempting to construct a multiagent RL trainer that has a learnable policy that plays against a range of saved historic variants of the trained agents, selected from a fixed size menagerie.
At the moment this is achieved through a callback function, for which the on_train_result function
- Adds to a list of stored policy indices of fixed size (so popping off the first element if the list exceeds the memory)
- Defines a new policy_mapping_fn which selects a policy from the list of stored policy indices.
- If an element has been popped from the list, calling algorithm.remove_policy(policy_id=old_policy_id) to truncate the list
- Creating a new policy, getting the state from the current active learning policy, and assigning that to the new policy.
- Calling algorithm.sync_weights()
However, step 3 is not working as I would expect. If I remove policies from the policy set, I receive an error stating that the policy_mapping_fn has returned an invalid policy_id. Removing step 3), and just letting the size of the dictionary of the weights associated with the previously identified solutions grow without any pruning works just fine, but obviously this is unlikely to be optimal.
To me this would suggest that I don’t understand how the callback function is working in the context of the distributed solver, and that something has become un-synchronised during the process. I haven’t been able to track down any example of multiagent environments using remove_policy, so I’m struggling to work out the correct RLLIB way to manage this synchronisation issue, and was wondering if anyone had any experience on this front?
Thanks in advance
How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.