Hybrid Offline learning and PPO?

fruzti · May 18, 2023, 10:41am

Hi all,

I’ve been looking around and I’m now wondering if it would make sense to combine offline rl with PPO (or another on-line rl algorithm)?

I ask because in my application it is posible to have some historical data of trajectories for particular examples as well an appropiate simulation environment for on-line rl. I was thinking in sort of “warm start” the online algorithm with expert knowledge, let say.

If the above if possible, what could be a sort of “best practice” to do so? Any direction indication would be very appreciated.

If not, what would be the way to go? Any sugestion?

\mario

arturn · July 23, 2023, 10:56pm

Yes, it does make sense.
Have a look at this example which uses our new RL Modules API:

Morphlng · March 4, 2024, 1:59am

Hi! I’ve tried this example and found that PPO training will lead to a drop instead of increment on the performance (But still better than from scratch, so the model is loaded). The more episodes BC pretrained, the more drop it will be. I wonder if this is expected?

omsrisagar · May 8, 2024, 11:57pm

Seems like the link is broken. This is the correct link:

github.com

ray-project/ray/blob/master/rllib/examples/learners/train_w_bc_finetune_w_ppo.py

"""
This example shows how to pretrain an RLModule using behavioral cloning from offline
data and, thereafter, continue training it online with PPO (fine-tuning).
"""

import gymnasium as gym
import shutil
import tempfile
import torch
from typing import Mapping

import ray
from ray import tune
from ray.train import RunConfig, FailureConfig
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.ppo.torch.ppo_torch_rl_module import PPOTorchRLModule
from ray.rllib.algorithms.ppo.ppo_catalog import PPOCatalog
from ray.rllib.core.models.base import ACTOR, ENCODER_OUT
from ray.rllib.core.rl_module.rl_module import SingleAgentRLModuleSpec

This file has been truncated. show original

Erica · April 17, 2025, 5:54am

Other link is also broken, here are two working links I found:

github.com/richardliaw/ray

rllib/examples/offline_rl/train_w_bc_finetune_w_ppo.py

932919e7b


      
          """Example of training a custom RLModule with BC first, then finetuning it with PPO.
          
          This example:
              - demonstrates how to write a very simple custom BC RLModule.
              - run a quick BC training experiment with the custom module and learn CartPole
              until some episode return A, while checkpointing each iteration.
              - shows how subclass the custom BC RLModule, add the ValueFunctionAPI to the
              new class, and add a value-function branch and an implementation of
              `compute_values` to the original model to make it work with a value-based algo
              like PPO.
              - shows how to plug this new PPO-capable RLModule (including its checkpointed state
              from the BC run) into your algorithm's config.
              - confirms that even after 1-2 training iterations with PPO, no catastrophic
              forgetting occurs (due to the additional value function branch and the switched

github.com/ericl/ray

rllib/examples/learner/train_w_bc_finetune_w_ppo.py

e9a1c6d81


      
          """
          This example shows how to pretrain an RLModule using behavioral cloning from offline
          data and, thereafter training it online with PPO.
          """
          
          import gymnasium as gym
          import shutil
          import tempfile
          import torch
          from typing import Mapping
          
          import ray
          from ray import tune
          from ray.air import RunConfig, FailureConfig

Topic		Replies	Views
Offline RL with DQN, PPO, etc Offline RL	0	322	November 5, 2023
How to pretrain a model with behavior cloning RLlib	14	5262	December 5, 2023
Offline data tutorial sub-performs RLlib	2	39	May 12, 2025
Custom RLmodule Configure Algorithm, Training, Evaluation, Scaling	2	38	May 8, 2025
How to use my pretrained model as policy and value netwok RLlib	6	1213	December 26, 2023

Hybrid Offline learning and PPO?

Related topics