How to deploy a trained Ray RLlib PPO policy/model in multi-agent-case?

Hello,

How can I deploy a trained Ray RLlib PPO policy/model in multi-agent-case and using an RNN-based policy?

I guess the first step is to load/restore the PPO Trainer (i.e. trainer.restore(checkpoint)).
Then there are the functions trainer.compute_single_action and trainer.compute_actions. The latter seems to compute actions for a batch of observations under one specific policy.

What I want is to compute a single action for one of the agents using its RNN-based policy.
Do I have to use trainer.compute_single_action and pass observation, RNN-state and policy ID to it?

@gjoliver any ideas here?

Hi @klausk55,

Have a look at this documentation: RLlib Training APIs — Ray v1.8.0

In the multiagent case the obs should be a dictionary with the agent(s) you want to compute the actions for.

Also don’t forget you need to chain the output state of one call as the input state of the next call to compute actions for that same agent.

Here is an example from serve:
RLlib Tutorial — Ray v1.8.0

Thanks @mannyv!

I guess in the multi-agent case where the obs is a MultiAgentDict, the invoking method should be compute_actionS since it accepts a dict as an obs.

Here you mean the internal state in case of an RNN-based policy, right? If so, what would you say is an approriate initial state for the first call to compute an action? Simply zero arrays?

Yes, that’s a great example for an online serving use case! You’ve already made me aware of this in a previous post. I appreciate your help, thanks!

@klausk55,

compute_actions should also accept a dictionary observation. You can use either.

The trainer has a get_initial_state method you can use.

What do you think @mannyv? Could it look like this?

    state = {}
    done = False

    obs_dict = env.reset()

    while not done:
        action_dict = {}
        for agent_id, obs in obs_dict.items():
            if state.get(agent_id) is None:
                state[agent_id] = trainer.get_policy(
                    policy_id="policy_{}".format(agent_id)).get_initial_state()
            action_dict[agent_id], state[agent_id], _ = trainer.compute_single_action(
                observation=obs, state=state[agent_id], policy_id="policy_{}".format(agent_id))
        obs_dict, reward, done, info = env.step(action_dict)