MBMPO Questions & Implementing Model-Based Policy Optimization

stwerner · March 2, 2022, 4:28pm

I am currently working on a custom implementation of the MBPO algorithm that uses generated experience to update a Soft Actor Critic policy. To start off, I have looked into the implementation of Ray-MBMPO, which is quite helpful.

MBMPO wraps the learned world model with a model_vector_env (option custom_vector_env in config) to generate fake examples.

Question #1: am I being correct that the custom_vector_env option ensures that rollouts = from_actors(workers.remote_workers()) (Line 438) collects fake samples using the world model? If so, how is the real experience gathered?

Question #2: is there a convenient way to collect rollouts from the real environment as well as the fake environment? Usually, I would collect experience using rollouts = ParallelRollouts(workers, mode="bulk_sync"), although this would then only work either for the environment passed via the env or custom_vector_env options, right? How can I perform rollouts for the real and fake environment?

Question #3: how could I maintain two ReplayBuffers for the real and fake data, respectively?

I would greatly appreciate help for any of the above questions!

Topic		Replies	Views
Behavior Cloning through custom env RLlib	4	503	August 13, 2021
Help with ppo config in multiagent env with complex observations Configure Algorithm, Training, Evaluation, Scaling	0	38	April 11, 2025
How to use Custom Model in MultiAgent PPO Policy RLlib	3	1244	August 9, 2023
Saving model / policies / weights after PPO training with a custom TFModelV2 Checkpointing, Restoring	3	392	March 7, 2024
Non acting agents in APPO RLlib	2	263	January 26, 2022

MBMPO Questions & Implementing Model-Based Policy Optimization

Related topics