How does agent know from what action it gets reward?

Let’s take a simple game where you shoot many times somewhere and get reward on target hitting accuracy in the future (like shooting bombs that fly many hours). You shoot 10 times consequentially and only then get reward value (+1 or -1). Or another example you play AlphaGo and your step will affect in the distant future.

So, how does algorithms understand what action was good and what was not? Can I clarify it and help agent understand more? For example, in real life when I shoot bomb, I will be able to see on the map what bomb has been exploded successfully.
Can I make reward for particular action in the past? gym loop returns only current step reward state, reward, done, info = env.step(action), so I guess it does not understand what exactly action in the past impacts on this reward, it’s just generalize. Does that correct?

No, it doesn’t know the impact, it estimates roughly by discount factor, it doesn’t need to be accurate, it just need to be good enough so that the algorithm learn from many trials. the problem you are mentioning is called credit assignment problem, in case you want to read more.

@sirjay

Can I clarify it and help agent understand more?

Yes You can. You can use reward shaping method. For example if You want to win chess match you can define some small positive rewards for action which effect is positive correlated with wining and some small penalty (negative reward) for action which effect is negative correlated with wining.

For example
If afterr action You loses the queen - some small penalty -0.09
if after action oponent loses the queen - some small reward +0.09
for rock +/- 0.05
for bishop, knight +/- 0.03
for pawn +/- 0.01

You can define values in that way that sum of all pieces is smaller than 1 because your goal is to win the match not eliminate all the pieces of your oponent.

After some number of iterations you can save trained model and run this pretrained model again with sparse rewarda only, +1 if You win and -1 if You lose.

It should help to train model because agent faster understands which moves are correlated with final reward.

After some number of iterations you can save trained model and run this pretrained model again with sparse rewarda only, +1 if You win and -1 if You lose.

Something like this? Does it really the most important information about last reward value from episode?

def step(self, action):
    obs = self.next_observation()
    reward = self.take_action(action)
    done = self.is_done()

    if done:
        # here is last step from episode
        if self.is_good_game(): reward = 1
        else: reward = -1

    return obs, reward, done, {}

@sirjay I thought about the solution:

  1. Train model with modified rewards
  2. Save model to files
  3. Load pretrained model from files and use environment with sparse rewards

How to save and load rllib files:
Ray RLlib: How to Save a Trained Agent for Later Use
Ray RLlib: How to Use and Record a Saved Agent