MultiAgentEnv Delayed rewards

iykim · June 2, 2025, 11:04am

1. Severity of the issue: (select one)
[O] None: I’m just curious or want clarification.

2. Environment:

Ray version: latest
Python version: 3.10
OS: ubuntu

Hello. Regarding MultiAgentEnv, how is the reward used if it was not given in the same step as the observation?

In the rllib multi agent env document, it says:

Note that the rule of observation dicts determining the exact order of agent moves doesn’t equally apply to either reward dicts nor termination/truncation dicts

And in the example for tic-tac-toe:

# Final reward is +5 for victory and -5 for a loss.
rewards[self.current_player] += 5.0
rewards[opponent] = -5.0

The reward for the opponent is given, eventhough it did not act this step.

So how is this reward applied to the training data? Can I give multiple rewards after a single observation? How are they aggregated? These stuff is not clear in the document.

christina · June 2, 2025, 10:02pm

You don’t have to line rewards up with the exact step an agent acts. RLlib keeps a small reward inbox (one per agent). Any reward you send (even if that agent didn’t move this turn), goes into its inbox. When the agent act again (or when the episode ends), RLlib empties the inbox, adds up everything that’s in there, and writes that total into the training batch for that agent.

So tl;dr, yes, you can give a reward to an agent on a step where it had no observation.

I think they might mention it a bit further down in the doc you linked: Multi-Agent Environments — Ray 2.46.0

so what I think is happening in the Tic-Tac-Toe example: when player A wins, you send +5 to A and -5 to B immediately; B’s reward sits in its inbox until B’s next turn (which, in a finished game, is the final step). Let me know if that made sense

iykim · June 3, 2025, 10:23am

Hello. Thank you for your time!

Yes your explanation made sense to me! But I have additional questions:

Can multiple rewards be given for a single observation? E.g. at t=0, observation is given, t=1 reward given, t=2 reward given. Is this possible? If so, how are the rewards matched to the observation?

A: So from your answer, these will be added and be matched with the observation at t=0. I got this.

What will happen if two observations are given sequentially without any reward in between? e.g. t=0 → obs, t=1 → obs, and episode terminates.

Topic		Replies	Views
Multi-Agent cyclic games with paused agents RLlib	2	461	September 27, 2021
Multi-agent Env with different reward functions for different agents? RLlib	6	409	September 14, 2021
How to share obsrvations and rewards in Multi-Agent ExternallEnv? RLlib	2	430	July 27, 2022
Rewards leaks to different multi agent policies in training only Configure Algorithm, Training, Evaluation, Scaling	3	161	May 31, 2024
How to separate rewards between agent in adversarial multi agent env RLlib	3	480	August 16, 2022

MultiAgentEnv Delayed rewards

Related topics