We want to save the <s, a, r> trajectories and ingest those samples in the trainer at a later time for the on-policy PPO algorithm. In policy_client.py and policy_server_input.py examples, the client sends the train batch samples to the trainer. Is there any way we can save those samples and send them all at once? What code changes, if any, would I have to make? Thanks for any pointers.
Hey @Sudhir , I’m not sure I understand exactly what mean by “all at once”. Would simply increasing the train_batch_size parameter help here?
Also, if you are using “inference_mode=local” (the default), the policy_client will use its own policy to query for actions, such that not every single observation goes to the server for action computations. In this case, the batch generated on the client side (using the client’s own RolloutWorker) should go into the policy-server for training all “at once”.
Hi @sven1977 - thanks for your reply. I understand in local inference mode there is no interaction with trainer every step, however, in our case the duration between steps is very large and it does not make sense to run ray in between the steps even if the CPU utilization is low during that time. We are investigating if we can save the step-wise data (s, a, r) on disk and then send it to the trainer when the training time arrives by literally re-starting and re-synchronizing the policy client and server at the start of each step. How can I save the sequence of <state, action, rewards> to offline storage so when the ‘train_batch_size’ number of steps is reached, we can send all the sequences by sort of ‘playing’ all the steps collected thus far to the trainer?
Ah, got it. How about using a custom callback class:
config:
"callbacks": [your own sub-class of DefaultCallbacks]
then in that sub-class of yours, you override on_postprocess_trajectory and in there, you can do with the collected batch whatever you want, e.g. store it to disk.
Actually, this is pure offline RL isn’t it?
a) You collect samples (no training) and save them in logs.
b) At a later time, you take those (offline) samples and feed them to an offline RL algo, such as BC, MARWIL, (offline) DQN, etc…
@sven1977 - you are right it lends better to offline/off-policy RL and something we can look at. In the on-policy client/server scenario, (in local inference mode at least) it looks like the trajectory is saved in memory and then sent to the trainer when training batch size is reached. Is there a way to save this in-memory area to disk and keep accumulating it every step and then reload this back from disk to be sent to the trainer?