How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hello!
I have designed a cooperative environment and I am having trouble assigning rewards to agents so as to reinforce cooperative behaviors.
The rewards I am assigning are separate for each agent using the rewards dictionary.
More specifically, I have a time penalty for each step taken by the agent, a small positive reward if the agent took a “correct” action and, lastly, when the team reaches its goal, I assign both of the agents a +1 reward and end the episode.
After training for at least 1m steps, although the agents learn the policy and reach their goals, they don’t actually cooperate with each other.
Taking into account the episode_reward_mean that is calculated at the end of the episodes (which shows the total reward for all the agents of the team, instead of the mean reward of the team), it makes me think that I am not assigning rewards correctly to reinforce cooperative behaviors.
Is there a specific way to assign rewards to reinforce cooperative behaviors?
Or in other words, how do you assign rewards so that the agents maximize their overall reward?
Tune will report the sum of rewards as the overall reward and will score based on this to determine checkpoints and so on. This should still not be confused with the way your policies are optimized, which is dependant on how you shape your reward. In your case, it is up to you to design your environment in a way that rewards collaboration. If you want agents to “help” each other, the rewards of the “beneficiary agent” must be included in the rewards of the “helping agent”.
A simple way to do this is, if you only have two agents, you can postprocess episodes to include for example a fraction of the rewards of the other agent.
I see, thanks!
I was worried I had to use something like a cumulative reward for both agents.
@rusu24edward
Both agents share the same policy.
It is basically a field coverage environment, where two agents start from the bottom right corner, with each one having a specific area of coverage.
On top of the rewards I mentioned, they’re taking a penalty depending on how large their area of effects are overlapping, but it seems that they are learning slowly.
Well, if by cumulative reward, you mean rewarding both agents with the sum of their rewards, that’s one version where the other agent’s objective is “worth” just as much. Maybe you can scale this similarly to the learning rate so that agents “mainly” learn to maximize their own rewards first before fine-tuning to help the other agent as well. It very much depends on your environment!
I am only giving them the extra +1 reward if they succeed (or -1 if they fail).
I will definitely try what you suggested.
As for the scaling, that’s basically what I am doing through a curriculum learning approach.
Initially, it’s just about covering the area (with a larger area of effect).
In future stages, the area of effect gets smaller and the agents take the negative reward if their areas are overlapping.
Are the other agents included in each other’s observations?
Are you adding an agent id to the observations to help the policy know which agent it is selecting actions for?
Whay algorithm are you using? If it is an actor critic algorithm are you using a centralized critic that determines the value based on the joint observations of all the agents?
I am including the other agents’ positions and I am adding the observations to a dictionary with keys as the agent’s id and values the observations.
I’ve been using PPO.
I recently tried SAC, A3C and IMPALA and I have some out of memory problems with those.
A (probably) important detail about the approach is that I am using a 84x84x4 binary image as an observation, with each channel having the following information:
Remaining area to be covered
The obstacles that exist in the environment
The current agent’s position
The other agent’s position
I am assuming that because of this I am having the OOM issues with the algorithms mentioned before. I have the following config values that most likely are the problem of the OOM, but I am not sure if they should be lower. The values are “inspired” from other examples I saw.