Handling multiple rewards to different branches of model

Aceticia · September 14, 2021, 2:54am

For my application, the model interacts with 2 environments simultaneously. The model starts with a shared encoder and then branches into two actors. These environments give their own rewards and each reward is used to train one branch of the network. For every timestep in environment 1, environment 2 will finish one episode (T>=1). What’s the most Ray-ish way of handling this? Thank you.

My current ideas:

I can write a wrapper for these two environments and alternate between them. I can record which environment a transition is from, and sort them into 2 different batches when learning. Is this a good idea? Will there be any problems with this implementation?

psxz · September 14, 2021, 12:02pm

Does each branch have its own pair of actor-critic?

Aceticia · September 14, 2021, 5:26pm

Yes, they do. Preferably their own optimizers as well.

mannyv · September 15, 2021, 1:11pm

Hi @Aceticia,

Based on what you have said so far this is how I would set it up.

I would create two policies, one for each environment.
I would create a meta-environment that switched between the two environments as needed. And make sure the agent_ids were distinct between the two sub environments.
I would write a policy_mapping_fn that assigned the agents to the appropriate policy.
I would write a custom model that had a shared sub-network following this example. You can ignore the multi-agent bits if your environment is not multiagent. The key thing to look at is how it creates a sub-model that is shared by multiple policies.

https://docs.ray.io/en/master/rllib-env.html?highlight=share%20model#variable-sharing-between-policies

github.com

ray-project/ray/blob/master/rllib/examples/multi_agent_cartpole.py

"""Simple example of setting up a multi-agent policy mapping.

Control the number of agents and policies via --num-agents and --num-policies.

This works with hundreds of agents and policies, but note that initializing
many TF policies will take some time.

Also, TF evals might slow down with large numbers of policies. To debug TF
execution, set the TF_TIMELINE_DIR environment variable.
"""

import argparse
import os
import random

import ray
from ray import tune
from ray.rllib.examples.env.multi_agent import MultiAgentCartPole
from ray.rllib.examples.models.shared_weights_model import \
    SharedWeightsModel1, SharedWeightsModel2, TF2SharedWeightsModel, \

This file has been truncated. show original

Good luck!

Topic		Replies	Views
Multi-agent Env with different reward functions for different agents? RLlib	6	403	September 14, 2021
Multi reward optimization RLlib	6	395	September 29, 2021
Agents sharing the environment for efficiency RLlib	3	253	October 29, 2021
Asymmetric play multiagent environment RLlib	2	452	January 6, 2022
[rllib] Modify multi agent env reward mid training RLlib	7	1298	May 27, 2021

Handling multiple rewards to different branches of model

Related topics