I’m interested in creating a multi-agent env with two agents in copies of the same (custom) environment. I’m interested in implementing a different reward for the two agents. Is there an easy way to make this happen?
Also, would this make sense with a centralized critic, or would it mess up the value function for the critic?
When I use a centralized critic in these cases I include an agent index variable that indicates which agent the reward is from. If the number agents is small I use a one-hot encoding and if they are >4 I use a binary encoding.
00000100 ← 6th agent id one-hot encoding
0110 ← 6th agent id binary encoding
@rsv , just a note, in the example you listed, this dict just returns ints for rewards. Would I return a dict in the reward function in order for this to be the output of a function, i.e. dynamic?
The link below shows a simple multi-agent env example. You are going to have a step method in your environment that will return 4 dicts. One for new_obs, reward, agentandenv_done, extra_info respectively.
Your environment returns reward dict in multiagent case. Rllib doesn’t calculate reward, so you can make it by function in the environment and it may be int or float