How can I deploy a trained Ray RLlib PPO policy/model in multi-agent-case and using an RNN-based policy?
I guess the first step is to load/restore the PPO Trainer (i.e. trainer.restore(checkpoint)).
Then there are the functions trainer.compute_single_action and trainer.compute_actions. The latter seems to compute actions for a batch of observations under one specific policy.
What I want is to compute a single action for one of the agents using its RNN-based policy.
Do I have to use trainer.compute_single_action and pass observation, RNN-state and policy ID to it?
I guess in the multi-agent case where the obs is a MultiAgentDict, the invoking method should be compute_actionS since it accepts a dict as an obs.
Here you mean the internal state in case of an RNN-based policy, right? If so, what would you say is an approriate initial state for the first call to compute an action? Simply zero arrays?
Yes, that’s a great example for an online serving use case! You’ve already made me aware of this in a previous post. I appreciate your help, thanks!