Currently I want to implement a central Critic. Sadly not all of my agents act every timestep.
There are two options for a central critic as far as i know.
use a mixin as in centralized_critic.py
→ This would mean that all actors would have to act at the same time. Is there an option around it?
→ I could have a separate agent who could observe the whole env. I found that I might use the episode argument and save something in the user_data field of the Episode. How could I ensure, that one agent gets processed first (the global agent would need to write the field). Preference wise, I think something in this direction would be the cleanest solution.
add global state to EVERY observation. (centralized_critic_2)
→ This certainly would work. I would like to avoid it. as its a huge overhead in every agents action I compute the central value,…
Do you have ideas on how to solve this problem or even just quick suggestions. Thank you.
Btw I’m using torch but I do not think solving this problem in the scope of the Model API helps.
What about using option 1 but putting the global state in the info dictionary then looking for it there in the callback? You could put it in one agent or all the agent’s. You would have full control of that part.
If this global state is large it does have the issue of adding a lot of extra data to the sample batch if you add it to every agents info dictionary. On the other hand if you put it in only one agents you would need some more complicated logic to find it if every agent has the possibility of being done before the episode finishes.
Another option that is commonly used is to not mark individual agents as done in the middle of the episode. Instead they would have a “null” observation (usually all zeros) and there would be a noop action. These two changes combined with an action mask that mask out all actions but noop for agents that are unofficially “done”. This approach really only works well in practice with discrete action spaces