Best practice for training on policy and off policy action together?

The comment here mentions that we can use On Policy and Off Policy actions together:

What is the best practice when we train it in that fashion? Do the get_action, log_returns and log_actions occur during the same step/episode?

This is what I have done so far and wanted to verify if this was the correct way:

1. Start an episode

2. Get an action from the policy client.

3. Log off policy action using log_returns function

4. Log rewards for the action taken by policy client

5. End the episode

As you can see, all the steps are happening within the same step/episode.

Any examples that have done this would also be helpful.

Hey @aviskarkc10 , thanks for the question.

I think the confusion here is the get_action vs log_action.
Whenever you use get_action, you calculate an action using your current policy. This is true for either inference_mode=“remote” (action gets computed on the server) or “local” (action gets computed on the client side). In both cases (“local” or “remote”) the action will be used automatically by the server for learning (on-policy action).

On top of that, you may also compute actions via another mechanism (e.g. pick random ones via some heuristic action computing system you are running on your client side), thus not using get_action, and then let your policy server know that you took these actions via log_action (for off-policy learning).

Of course, you could also do a a=get_action() plus log_action(a), but I don’t think this would make a lot of sense. In the latter case, a would automatically be used for learning.

Thanks for the great explanation.

My use case is a bit different. We have real users taking actions as well as the RL model. So instead of it being as:

a =  get_action()

log_action(a)

It is as follows:

on_policy_action = get_action()

log_action(some_other_action_taken_by_user)

log_returns(reward)

Hey @aviskarkc10 , I see. I think this is still possible though.

If you have inference_mode=“local”, you could access your local policy to compute your policy-action (instead of using get_action). This way, get_action (not used) would not interfere with your log_action calls and you would be in control of which actions to use for learning.

To get a local policy action, you could do:

client.rollout_worker.policy_map[`your policy ID or "default_policy"`].compute_single_action([some obs])

Does this make sense?

I think it makes sense yes. I first need to research a bit about compute_single_action but yeah your explanation does light up something in my head. Thanks