The comment here mentions that we can use On Policy and Off Policy actions together:
What is the best practice when we train it in that fashion? Do the
log_actions occur during the same step/episode?
This is what I have done so far and wanted to verify if this was the correct way:
1. Start an episode 2. Get an action from the policy client. 3. Log off policy action using log_returns function 4. Log rewards for the action taken by policy client 5. End the episode
As you can see, all the steps are happening within the same step/episode.
Any examples that have done this would also be helpful.