Handling multiple rewards to different branches of model

For my application, the model interacts with 2 environments simultaneously. The model starts with a shared encoder and then branches into two actors. These environments give their own rewards and each reward is used to train one branch of the network. For every timestep in environment 1, environment 2 will finish one episode (T>=1). What’s the most Ray-ish way of handling this? Thank you.

My current ideas:

  1. I can write a wrapper for these two environments and alternate between them. I can record which environment a transition is from, and sort them into 2 different batches when learning. Is this a good idea? Will there be any problems with this implementation?

Does each branch have its own pair of actor-critic?

Yes, they do. Preferably their own optimizers as well.

Hi @Aceticia,

Based on what you have said so far this is how I would set it up.

  1. I would create two policies, one for each environment.
  2. I would create a meta-environment that switched between the two environments as needed. And make sure the agent_ids were distinct between the two sub environments.
  3. I would write a policy_mapping_fn that assigned the agents to the appropriate policy.
  4. I would write a custom model that had a shared sub-network following this example. You can ignore the multi-agent bits if your environment is not multiagent. The key thing to look at is how it creates a sub-model that is shared by multiple policies.


Good luck!