What is the difference between `log_action` and `get_action` and when to use them?

What is the difference between the two functions log_action and get_action, and when should you use them?

Also how do you calculate the value of reward to pass to the log_returns function if you are using log_action since it doesn’t return the action that the RL Agent took?

Hi @aviskarkc10,

these two functions are used in external environments, e.g. environments that pull actions from the agent. Because of this, it is important that the environment can provide the agent with information about actions that have been taken that not necessarily come from the agent. Say you have a control environment where you have to take an action every 100ms - if action inference takes longer, the environment might make several steps without querying the agent. In that case you would still want to inform the agent about the actions you have taken.

log_action is used to record these off-policy actions. get_action returns (and logs) the current on-policy action to execute in the environment.

The reward comes from the environment - and since you pass the off-policy action to log_action you know it already and usually did an environment step before.

2 Likes

Hi @kai,

Thanks for the information.

So this is really a use-case specific question. Consider a RL Agent that learns by imitating the actions of a user/human. So we would:

1. start an episode
2. call `log_action`, log an action based on the action of the user
3. call `log_returns`, log the reward
4. end the episode 

Now I am stuck on the third step on how I would calculate the reward to be passed in the log_returns function.

The action of the user has to passed into the environment at some point. This usually uses a custom step function that calculates the reward.

If you can share parts of your code (specifically the control code and the environment), we might be able to help you better understand this.

So out log_action function would call the step function that is in our Agent definition?

Here is what I have right now:

# this is the server file
# Define the RL Agent

class RLAgent(gym.env):
   def __init__(**kwargs):
       # initialize the agent

So based on what you are saying, I would add the step function also in the Agent definition that would calculate the reward and our log_action function calls this step function?

Hi @aviskarkc10,

Check out this example. It should have most of the components you need. Let us know if you need more information.

With this approach you will also need to run a server.

@aviskarkc10,

An alternative approach that might be closer to what you want to do is in the documentation here:

https://docs.ray.io/en/master/rllib-offline.html?highlight=offline#example-converting-external-experiences-to-batch-format

Thanks. I finally got the entire picture

1 Like

@kai, @mannyv, @sven1977 and so on:
Please correct me if I’m wrong, but log_action does also do a call to the NN, I guess to calculate V(s) for the sent observation belonging to the logged off-policy action.
Also, the logged off-policy action will be stored in a sample batch for later training, right?

Hi @klausk55

log_action is part of the ExternalEnv / policySrrver/Client API and it is intended for offline collection. It does not call the NN. Get action does though.

@sven1977,

The policy client log _action calls update_local_policy but I don’t think it actually uses it. I think it can be removed.

1 Like

Hey @mannyv , thanks for digging into this. I think we should leave this call to _update_local_policy() (only done for inference-mode=“local”) inside log_action.
It just makes sure that we stick to the update interval, with which the local policy is synched from the server. Could be that we have not called any of “log_action” or “get_action” in a long time and want to make sure we have the last version of weights. Maybe the policy was further learnt on the server from other clients’ data and maybe we would want to do something with the weights, even w/o calling log_action/get_action.

1 Like

Hey @mannyv,

Sorry mate, but I’m almost sure that also log_action does a forward call to the NN.
Inside the function _env_runner in sampler.py we first poll data from the env and log_action puts data (obs and off-policy action) into the queue. Some lines of code later, I guess in _do_policy_eval, there should happen a forward call to the NN. I tried and tested it with some simple printouts to the console (e.g. a printout in function forward of the model).

@mannyv and @sven1977 I’m still not quite sure how such an ‘offline/off-policy-collected sample’ from log_action is processed by RLlib?
Suppose in the case of PPO algorithm, is such a sample also stored in a batch and treated like any other default on-policy-collected sample? (Thinking of a case like learning from demonstrations.)

Thanks in advance!

Hi @klausk55,

I was talking about the client side (PolicyClient). I do not think that calling log_actions on the client side causes it to interact with the neural network. The policy client does have a RolloutWorker within it so for example if the on_sample_end made a call to the NN then so would the PolicyClient

Looking at the code it does not look like the receiving side (PolicyServer) uses the neural network when it receives a log_action message.

If the PolicyServer was feeding data to a Trainer, for example PPO as you mention then yeah during on_sample_end (when it computes the advantage / GAE) then it would interact with the NN and of course during learn_on_batch it would also interact with the neural network and update its weights. If you added a callback that interacted with the NN between environment steps then yes that would also occur but I think only on the server side not the client side.

WRT your PPO question, from the perspective of the Trainir this is just another input source so it would treat it the same way as it would treat data coming from a local environment.

Hey @mannyv,

I guess we are both right since IMO it depends on the framework setting.
With framework "tf" there won’t happen a forward call to the model, whereas with framework "tf2" (eager) there is a forward call to the model in the RolloutWorker on client side (inference_mode="local").
You might check out this slightly modified cartpole_server/client example and play around with framework settings.