How to save PPO trajectory and train at a later time

Sudhir · March 7, 2021, 5:05am

We want to save the <s, a, r> trajectories and ingest those samples in the trainer at a later time for the on-policy PPO algorithm. In policy_client.py and policy_server_input.py examples, the client sends the train batch samples to the trainer. Is there any way we can save those samples and send them all at once? What code changes, if any, would I have to make? Thanks for any pointers.

regards.

Ameer_Haj_Ali · March 10, 2021, 9:34pm

CC @sven1977. Can you please help?

sven1977 · March 11, 2021, 8:11am

Hey @Sudhir , I’m not sure I understand exactly what mean by “all at once”. Would simply increasing the train_batch_size parameter help here?

Also, if you are using “inference_mode=local” (the default), the policy_client will use its own policy to query for actions, such that not every single observation goes to the server for action computations. In this case, the batch generated on the client side (using the client’s own RolloutWorker) should go into the policy-server for training all “at once”.

Sudhir · March 11, 2021, 9:48am

Hi @sven1977 - thanks for your reply. I understand in local inference mode there is no interaction with trainer every step, however, in our case the duration between steps is very large and it does not make sense to run ray in between the steps even if the CPU utilization is low during that time. We are investigating if we can save the step-wise data (s, a, r) on disk and then send it to the trainer when the training time arrives by literally re-starting and re-synchronizing the policy client and server at the start of each step. How can I save the sequence of <state, action, rewards> to offline storage so when the ‘train_batch_size’ number of steps is reached, we can send all the sequences by sort of ‘playing’ all the steps collected thus far to the trainer?

sven1977 · March 12, 2021, 8:07pm

Ah, got it. How about using a custom callback class:

config:
    "callbacks": [your own sub-class of DefaultCallbacks]

then in that sub-class of yours, you override on_postprocess_trajectory and in there, you can do with the collected batch whatever you want, e.g. store it to disk.

Actually, this is pure offline RL isn’t it?
a) You collect samples (no training) and save them in logs.
b) At a later time, you take those (offline) samples and feed them to an offline RL algo, such as BC, MARWIL, (offline) DQN, etc…

Sudhir · March 24, 2021, 2:43pm

@sven1977 - you are right it lends better to offline/off-policy RL and something we can look at. In the on-policy client/server scenario, (in local inference mode at least) it looks like the trajectory is saved in memory and then sent to the trainer when training batch size is reached. Is there a way to save this in-memory area to disk and keep accumulating it every step and then reload this back from disk to be sent to the trainer?

sven1977 · March 29, 2021, 8:07am

Hey @Sudhir , currently there isn’t, no.

Topic		Replies	Views
Save played trajectories in memory RLlib	1	433	August 17, 2022
I want to save offline data for imaging learning. When it is a multi-agent environment, can only the trajectory of a particular agent be stored? RLlib	1	339	March 22, 2023
Delayed Learning Due To Long Episode Lengths RLlib	9	1296	September 10, 2021
Saving episode trajectories during training RLlib	0	221	July 13, 2023
Load/save replay buffer RLlib	5	783	September 18, 2022

How to save PPO trajectory and train at a later time

Related topics