I am currently working on a custom implementation of the MBPO algorithm that uses generated experience to update a Soft Actor Critic policy. To start off, I have looked into the implementation of Ray-MBMPO, which is quite helpful.
MBMPO wraps the learned world model with a model_vector_env
(option custom_vector_env
in config) to generate fake examples.
Question #1: am I being correct that the custom_vector_env
option ensures that rollouts = from_actors(workers.remote_workers())
(Line 438) collects fake samples using the world model? If so, how is the real experience gathered?
Question #2: is there a convenient way to collect rollouts from the real environment as well as the fake environment? Usually, I would collect experience using rollouts = ParallelRollouts(workers, mode="bulk_sync")
, although this would then only work either for the environment passed via the env
or custom_vector_env
options, right? How can I perform rollouts for the real and fake environment?
Question #3: how could I maintain two ReplayBuffers
for the real and fake data, respectively?
I would greatly appreciate help for any of the above questions!