MultiAgentEnv Delayed rewards

1. Severity of the issue: (select one)
[O] None: I’m just curious or want clarification.

2. Environment:

  • Ray version: latest
  • Python version: 3.10
  • OS: ubuntu

Hello. Regarding MultiAgentEnv, how is the reward used if it was not given in the same step as the observation?

In the rllib multi agent env document, it says:

Note that the rule of observation dicts determining the exact order of agent moves doesn’t equally apply to either reward dicts nor termination/truncation dicts

And in the example for tic-tac-toe:

# Final reward is +5 for victory and -5 for a loss.
rewards[self.current_player] += 5.0
rewards[opponent] = -5.0

The reward for the opponent is given, eventhough it did not act this step.

So how is this reward applied to the training data? Can I give multiple rewards after a single observation? How are they aggregated? These stuff is not clear in the document.

You don’t have to line rewards up with the exact step an agent acts. RLlib keeps a small reward inbox (one per agent). Any reward you send (even if that agent didn’t move this turn), goes into its inbox. When the agent act again (or when the episode ends), RLlib empties the inbox, adds up everything that’s in there, and writes that total into the training batch for that agent.

So tl;dr, yes, you can give a reward to an agent on a step where it had no observation.

I think they might mention it a bit further down in the doc you linked: Multi-Agent Environments — Ray 2.46.0

so what I think is happening in the Tic-Tac-Toe example: when player A wins, you send +5 to A and -5 to B immediately; B’s reward sits in its inbox until B’s next turn (which, in a finished game, is the final step). Let me know if that made sense :slight_smile:

Hello. Thank you for your time!

Yes your explanation made sense to me! But I have additional questions:

  1. Can multiple rewards be given for a single observation? E.g. at t=0, observation is given, t=1 reward given, t=2 reward given. Is this possible? If so, how are the rewards matched to the observation?

A: So from your answer, these will be added and be matched with the observation at t=0. I got this.

  1. What will happen if two observations are given sequentially without any reward in between? e.g. t=0 → obs, t=1 → obs, and episode terminates.