Hello,
I’m interested in beginning training by behavior cloning. I’ve noticed that I could possibly use the MARWIL trainer, but I’ve already built into my custom environment the ability to emit “expert” actions into the batch info that later in on_postprocess_trajectory overwrite the existing actions.
Is there any reason this would not work or is inefficient in some way? I just would like others’ opinions prior to investing heavy training time.
Thanks,
Curt
Hey thanks for the question!
cc @sven1977 @rliaw any ideas here?
I think it doesn’t work well in practice. If you’re running a general off-policy (SAC/DDPG) or on-policy (PPO, IMPALA) algorithm, if your replay buffer contains only “expert data”, training usually doesn’t go well. For reference, offline RL algorithms benchmarks against general off-policy algorithms for various fixed datasets, including expert dataset. In this paper https://arxiv.org/pdf/2006.04779.pdf, they benchmark SAC on various expert datasets and it doesn’t do well.
This is probably due to the loss function. The loss function is inherently different than that of imitation learning and the agent will need to go through a more diverse set of states to begin learning, let alone imitate expert behavior.
how about incorporating an imitation loss directly into the agent’s training loss, and progressively adjust the weight ratio between imitation loss and policy loss? This is called “interactive expert” I believe.
Thanks for the response. One thing I didn’t quite understand: my experiment is on-policy (PPO) but you mention the replay buffer only containing expert data. Since PPO is on-policy, why does that matter? Maybe I just didn’t follow that part precisely.
It seems then that the way to go is to adapt to imitation learning instead. Thanks for the assistance!