Delayed assigning of rewards resp. punishments in single-agent RL

Hi folks!

I known this forum isn’t the most appropriate place to discuss the following issue, but maybe some of you still have suggestions. Also recommendations on forums to better post such general RL questions are welcome.

First I thought a multi-agent RL approach would be necessary in my considered use case I’m working on. But after I had reconsidered, I think a single-agent approach could also work, might be easier to implement and perhaps is more intuitive.

But I’m not sure of consequences resulting from one aspect which comes along with change to single-agent, namely: In many steps there is no immediate assignment of rewards resp. punishments. Just a very few steps allow direct feedback and thus many situations arise where the feedback to the single-agent is neutral (i.e. reward=0). As an assessment often can be made only quite a few steps later, in the meantime the agent might have taken further action(s). For me this means that for example a punishment is assigned to a state-action pair (s, a) which isn’t “responsible for” that punishment. The punishment is independent of the most recently taken action.

Guys, what do you think, is such a delayed and incorrect assignment problematic for learning? Does it distort the perception of the agent and makes learning a task impossible?

Hi @klausk55,

This is known as the credit assignment problem. Here is a conference workshop from a few years back that might be helpful as a place to start.

1 Like

Hi @klausk55, yes, what @mannyv said :slightly_smiling_face:

RL should be able to deal with these delayed or even “misleading” rewards. Think about AlphaGo and how it was able to learn with only a simple +1 or -1 reward at the very end of the game (all other rewards were 0.0!). However, of course, learning is not guaranteed here and shaping your reward function to make it richer (more often reward != 0.0) and more direct (directly following on good/bad actions) almost always has a beneficial effect on learning.

1 Like

Thanks @mannyv and @sven1977 for your feedbacks!

Now, I also believe that single-agent RL should be able to deal with it! Especially if I utilize algos like PPO and a neural network including an LSTM cell. Also finding an appropriate (hyper-)parameter config seems to be very essential for learning success.

Btw: The maximal horizon length used for generalized advantage estimation (GAE) corresponds to rollout_fragment_length in the context of RLlib, right?

Awesome! That’s correct @klausk55 : After the rollout_fragment_length is reached by one agent, we must use the vf estimation to provide the last “reward” unless the episode is done and we can use 0.0 anyways. Over that fragment, we then do the GAE calculation.