Okay that looks fast. However, I’m trying to do what you said and when I call trainer.evaluate() I get the following error:
ValueError: Cannot evaluate w/o an evaluation worker set in the Trainer or w/o an env on the local worker!
Try one of the following:
1) Set `evaluation_interval` >= 0 to force creating a separate evaluation worker set.
2) Set `create_env_on_driver=True` to force the local (non-eval) worker to have an environment to evaluate on.
I tried to set trainer.config['create_env_on_driver'] = True before calling evaluate(), but it doesn’t change anything. And I guess the point 2) isn’t related to my case as I’m not training
config['evaluation_interval'] is None and evaluation_duration is not even present in the config. If I try to set config['evaluation_duration'] to some random value, I get
I was setting config['create_env_on_driver'] = True before calling evaluate() but after loading the trainer, therefore I guess it didn’t have any effect.
One small comment. My understanding is that if you trained with exploration/stochastic actions then you can only expect your policy to produce its optimal actions with exploration on during evaluation as well.
Theory aside, I have tested this on my policies and environments and it has consistently been that case for me that I see performance deteriorate if the explore setting differs between training and testing. YMMV.