In the documentation, it infers that a single policy is mapped to a particular agent for the course of training. However, in the multi-agent examples, the policy mapping. function utilises a random choice to assign policies to an agent. Why is this? Have I misunderstood the docs ? Thank you!
You have the concept correct. Each agent is mapped to a policy. Any time rllib sees a new agent returned from reset or step it will use the policy mapping function to determine which policy to map it to. When it is a single agent setup rllib automatically creates a policy called “default_policy” and maps all agents to it.
The examples are just showing how to create the multiple policies in the multiagent dictionary and a mapping function.Whoever wrote it decided that they would just assign agents randomly. You can do somthing else in your function that is more appropriate for your environment.
In the example above they create three policies. One for traffic lights and two for cars. All the traffic lights in the environment always use the same policy but the cars are randomly assigned to one of the two policies. Even thought the initial assignment is random, once an agent is assigned to a policy it will always use that same policy during that training session.
That cleared it up, thanks!