How do i compute an action from a trained RLlib PPO policy with the new API?

I’m trying to get a computed action from a trained PPO model, but all of the examples I’ve found are from the old API, not the new one.

What is the new API equivalent to this?

action = my_trained_policy.compute_single_action(obs)

In the new RLlib API stack, you should use the RLModule’s forward_inference method instead of compute_single_action. Specifically, after loading your RLModule from a checkpoint, call rl_module.forward_inference({“obs”: torch_obs_batch}) to get action logits, then sample or select an action from the resulting distribution. The old compute_single_action API is deprecated and not available in the new stack. See the official example here and discussion in this issue.

Would you like a step-by-step code example for this process?

Sources:

Hint: Mention @RunLLM in the post for followups.

A step-by-step code example would be great. I have a continuous action space. So I think I need to sample from the distribution. Will the sampled values correspond to “recommended” values that optimize the reward?