I’ve been looking around and I’m now wondering if it would make sense to combine offline rl with PPO (or another on-line rl algorithm)?
I ask because in my application it is posible to have some historical data of trajectories for particular examples as well an appropiate simulation environment for on-line rl. I was thinking in sort of “warm start” the online algorithm with expert knowledge, let say.
If the above if possible, what could be a sort of “best practice” to do so? Any direction indication would be very appreciated.
If not, what would be the way to go? Any sugestion?
Hi! I’ve tried this example and found that PPO training will lead to a drop instead of increment on the performance (But still better than from scratch, so the model is loaded). The more episodes BC pretrained, the more drop it will be. I wonder if this is expected?