Delayed assigning of rewards resp. punishments in single-agent RL

klausk55 · April 23, 2021, 11:22am

Hi folks!

I known this forum isn’t the most appropriate place to discuss the following issue, but maybe some of you still have suggestions. Also recommendations on forums to better post such general RL questions are welcome.

First I thought a multi-agent RL approach would be necessary in my considered use case I’m working on. But after I had reconsidered, I think a single-agent approach could also work, might be easier to implement and perhaps is more intuitive.

But I’m not sure of consequences resulting from one aspect which comes along with change to single-agent, namely: In many steps there is no immediate assignment of rewards resp. punishments. Just a very few steps allow direct feedback and thus many situations arise where the feedback to the single-agent is neutral (i.e. reward=0). As an assessment often can be made only quite a few steps later, in the meantime the agent might have taken further action(s). For me this means that for example a punishment is assigned to a state-action pair (s, a) which isn’t “responsible for” that punishment. The punishment is independent of the most recently taken action.

Guys, what do you think, is such a delayed and incorrect assignment problematic for learning? Does it distort the perception of the agent and makes learning a task impossible?

mannyv · April 23, 2021, 5:20pm

Hi @klausk55,

This is known as the credit assignment problem. Here is a conference workshop from a few years back that might be helpful as a place to start.

sven1977 · April 27, 2021, 9:56am

Hi @klausk55, yes, what @mannyv said

RL should be able to deal with these delayed or even “misleading” rewards. Think about AlphaGo and how it was able to learn with only a simple +1 or -1 reward at the very end of the game (all other rewards were 0.0!). However, of course, learning is not guaranteed here and shaping your reward function to make it richer (more often reward != 0.0) and more direct (directly following on good/bad actions) almost always has a beneficial effect on learning.

klausk55 · April 27, 2021, 11:38am

Thanks @mannyv and @sven1977 for your feedbacks!

Now, I also believe that single-agent RL should be able to deal with it! Especially if I utilize algos like PPO and a neural network including an LSTM cell. Also finding an appropriate (hyper-)parameter config seems to be very essential for learning success.

Btw: The maximal horizon length used for generalized advantage estimation (GAE) corresponds to rollout_fragment_length in the context of RLlib, right?

sven1977 · April 27, 2021, 12:12pm

Awesome! That’s correct @klausk55 : After the rollout_fragment_length is reached by one agent, we must use the vf estimation to provide the last “reward” unless the episode is done and we can use 0.0 anyways. Over that fragment, we then do the GAE calculation.

Topic		Replies	Views
MultiAgentEnv Delayed rewards RLlib	2	33	June 3, 2025
How does agent know from what action it gets reward? RLlib	4	610	May 22, 2022
Decentralized multi agent reinforcement learning RLlib	4	121	November 2, 2024
Adding priority to MARL RLlib	5	698	October 19, 2021
MARL modeling issue RLlib	7	945	March 31, 2021

Delayed assigning of rewards resp. punishments in single-agent RL

Related topics