MBMPO Questions & Implementing Model-Based Policy Optimization

I am currently working on a custom implementation of the MBPO algorithm that uses generated experience to update a Soft Actor Critic policy. To start off, I have looked into the implementation of Ray-MBMPO, which is quite helpful.

MBMPO wraps the learned world model with a model_vector_env (option custom_vector_env in config) to generate fake examples.

Question #1: am I being correct that the custom_vector_env option ensures that rollouts = from_actors(workers.remote_workers()) (Line 438) collects fake samples using the world model? If so, how is the real experience gathered?

Question #2: is there a convenient way to collect rollouts from the real environment as well as the fake environment? Usually, I would collect experience using rollouts = ParallelRollouts(workers, mode="bulk_sync"), although this would then only work either for the environment passed via the env or custom_vector_env options, right? How can I perform rollouts for the real and fake environment?

Question #3: how could I maintain two ReplayBuffers for the real and fake data, respectively?

I would greatly appreciate help for any of the above questions! :slight_smile: