Best practice for training on policy and off policy action together?

aviskarkc10 · September 20, 2021, 1:58pm

The comment here mentions that we can use On Policy and Off Policy actions together:

ray-project/ray/blob/master/rllib/examples/serving/cartpole_client.py#L34

    
      
          
          
In "remote" inference mode, the PolicyClient will send action requests to the
          server and not compute its own actions locally. The server then performs the
          inference forward pass and returns the action to the client.
          
          
In either case, the user of PolicyClient must:
          - Declare new episodes and finished episodes to the PolicyClient.
          - Log rewards to the PolicyClient.
          - Call `get_action` to receive an action from the PolicyClient (whether it'd be
            computed locally or remotely).
          - Besides `get_action`, the user may let the PolicyClient know about
            off-policy actions having been taken via `log_action`. This can be used in
            combination with `get_action`, but will only work, if the connected server
            runs an off-policy RL algorithm (such as DQN, SAC, or DDPG).
          """
          
          
import argparse
          import gym
          
          
from ray.rllib.env.policy_client import PolicyClient

What is the best practice when we train it in that fashion? Do the get_action, log_returns and log_actions occur during the same step/episode?

This is what I have done so far and wanted to verify if this was the correct way:

1. Start an episode

2. Get an action from the policy client.

3. Log off policy action using log_returns function

4. Log rewards for the action taken by policy client

5. End the episode

As you can see, all the steps are happening within the same step/episode.

Any examples that have done this would also be helpful.

sven1977 · September 22, 2021, 6:43am

Hey @aviskarkc10 , thanks for the question.

I think the confusion here is the get_action vs log_action.
Whenever you use get_action, you calculate an action using your current policy. This is true for either inference_mode=“remote” (action gets computed on the server) or “local” (action gets computed on the client side). In both cases (“local” or “remote”) the action will be used automatically by the server for learning (on-policy action).

On top of that, you may also compute actions via another mechanism (e.g. pick random ones via some heuristic action computing system you are running on your client side), thus not using get_action, and then let your policy server know that you took these actions via log_action (for off-policy learning).

Of course, you could also do a a=get_action() plus log_action(a), but I don’t think this would make a lot of sense. In the latter case, a would automatically be used for learning.

aviskarkc10 · September 22, 2021, 7:24am

Thanks for the great explanation.

My use case is a bit different. We have real users taking actions as well as the RL model. So instead of it being as:

a =  get_action()

log_action(a)

It is as follows:

on_policy_action = get_action()

log_action(some_other_action_taken_by_user)

log_returns(reward)

sven1977 · September 27, 2021, 6:58am

Hey @aviskarkc10 , I see. I think this is still possible though.

If you have inference_mode=“local”, you could access your local policy to compute your policy-action (instead of using get_action). This way, get_action (not used) would not interfere with your log_action calls and you would be in control of which actions to use for learning.

To get a local policy action, you could do:

client.rollout_worker.policy_map[`your policy ID or "default_policy"`].compute_single_action([some obs])

Does this make sense?

aviskarkc10 · September 27, 2021, 7:16am

I think it makes sense yes. I first need to research a bit about compute_single_action but yeah your explanation does light up something in my head. Thanks

Topic		Replies	Views
Best practice for using `get_action` and `log_action` together? RLlib	1	208	August 19, 2021
External On-Policy Actions in PPO RLlib	3	651	June 23, 2021
PolicyClient should be agent-done smart RLlib	1	217	January 12, 2022
RLlib's PolicyServer and external simulator as client RLlib	15	1735	April 12, 2021
Off policy algorithms start doing the same action RLlib	9	426	December 31, 2022

Best practice for training on policy and off policy action together?

Related topics