Behavior Cloning through custom env

cwerner · August 12, 2021, 12:30am

Hello,

I’m interested in beginning training by behavior cloning. I’ve noticed that I could possibly use the MARWIL trainer, but I’ve already built into my custom environment the ability to emit “expert” actions into the batch info that later in on_postprocess_trajectory overwrite the existing actions.

Is there any reason this would not work or is inefficient in some way? I just would like others’ opinions prior to investing heavy training time.

Thanks,
Curt

amogkam · August 12, 2021, 1:06am

Hey thanks for the question!

cc @sven1977 @rliaw any ideas here?

michaelzhiluo · August 12, 2021, 8:40am

I think it doesn’t work well in practice. If you’re running a general off-policy (SAC/DDPG) or on-policy (PPO, IMPALA) algorithm, if your replay buffer contains only “expert data”, training usually doesn’t go well. For reference, offline RL algorithms benchmarks against general off-policy algorithms for various fixed datasets, including expert dataset. In this paper https://arxiv.org/pdf/2006.04779.pdf, they benchmark SAC on various expert datasets and it doesn’t do well.

This is probably due to the loss function. The loss function is inherently different than that of imitation learning and the agent will need to go through a more diverse set of states to begin learning, let alone imitate expert behavior.

mickelliu · August 12, 2021, 1:48pm

how about incorporating an imitation loss directly into the agent’s training loss, and progressively adjust the weight ratio between imitation loss and policy loss? This is called “interactive expert” I believe.

cwerner · August 13, 2021, 8:42am

Thanks for the response. One thing I didn’t quite understand: my experiment is on-policy (PPO) but you mention the replay buffer only containing expert data. Since PPO is on-policy, why does that matter? Maybe I just didn’t follow that part precisely.

It seems then that the way to go is to adapt to imitation learning instead. Thanks for the assistance!

Topic		Replies	Views
Sample Rule-Based Expert Demonstrations in Rllib RLlib	6	1271	January 24, 2023
How to pretrain a model with behavior cloning RLlib	14	5229	December 5, 2023
Action masks and loss functions RLlib	1	404	January 25, 2021
Not Sure Which RLlib Algorithm To Use RLlib	5	642	April 27, 2021
RLlib experiments Configure Algorithm, Training, Evaluation, Scaling	0	228	October 22, 2023

Behavior Cloning through custom env

Related topics