I known this forum isn’t the most appropriate place to discuss the following issue, but maybe some of you still have suggestions. Also recommendations on forums to better post such general RL questions are welcome.
First I thought a multi-agent RL approach would be necessary in my considered use case I’m working on. But after I had reconsidered, I think a single-agent approach could also work, might be easier to implement and perhaps is more intuitive.
But I’m not sure of consequences resulting from one aspect which comes along with change to single-agent, namely: In many steps there is no immediate assignment of rewards resp. punishments. Just a very few steps allow direct feedback and thus many situations arise where the feedback to the single-agent is neutral (i.e. reward=0). As an assessment often can be made only quite a few steps later, in the meantime the agent might have taken further action(s). For me this means that for example a punishment is assigned to a state-action pair (s, a) which isn’t “responsible for” that punishment. The punishment is independent of the most recently taken action.
Guys, what do you think, is such a delayed and incorrect assignment problematic for learning? Does it distort the perception of the agent and makes learning a task impossible?