I’ve been looking around and I’m now wondering if it would make sense to combine offline rl with PPO (or another on-line rl algorithm)?
I ask because in my application it is posible to have some historical data of trajectories for particular examples as well an appropiate simulation environment for on-line rl. I was thinking in sort of “warm start” the online algorithm with expert knowledge, let say.
If the above if possible, what could be a sort of “best practice” to do so? Any direction indication would be very appreciated.
If not, what would be the way to go? Any sugestion?