How does agent know from what action it gets reward?

sirjay · May 21, 2022, 3:06pm

Let’s take a simple game where you shoot many times somewhere and get reward on target hitting accuracy in the future (like shooting bombs that fly many hours). You shoot 10 times consequentially and only then get reward value (+1 or -1). Or another example you play AlphaGo and your step will affect in the distant future.

So, how does algorithms understand what action was good and what was not? Can I clarify it and help agent understand more? For example, in real life when I shoot bomb, I will be able to see on the map what bomb has been exploded successfully.
Can I make reward for particular action in the past? gym loop returns only current step reward state, reward, done, info = env.step(action), so I guess it does not understand what exactly action in the past impacts on this reward, it’s just generalize. Does that correct?

hossein836 · May 21, 2022, 4:22pm

No, it doesn’t know the impact, it estimates roughly by discount factor, it doesn’t need to be accurate, it just need to be good enough so that the algorithm learn from many trials. the problem you are mentioning is called credit assignment problem, in case you want to read more.

Peter_Pirog · May 21, 2022, 9:15pm

@sirjay

Can I clarify it and help agent understand more?

Yes You can. You can use reward shaping method. For example if You want to win chess match you can define some small positive rewards for action which effect is positive correlated with wining and some small penalty (negative reward) for action which effect is negative correlated with wining.

For example
If afterr action You loses the queen - some small penalty -0.09
if after action oponent loses the queen - some small reward +0.09
for rock +/- 0.05
for bishop, knight +/- 0.03
for pawn +/- 0.01

You can define values in that way that sum of all pieces is smaller than 1 because your goal is to win the match not eliminate all the pieces of your oponent.

After some number of iterations you can save trained model and run this pretrained model again with sparse rewarda only, +1 if You win and -1 if You lose.

It should help to train model because agent faster understands which moves are correlated with final reward.

sirjay · May 22, 2022, 8:51am

After some number of iterations you can save trained model and run this pretrained model again with sparse rewarda only, +1 if You win and -1 if You lose.

Something like this? Does it really the most important information about last reward value from episode?

def step(self, action):
    obs = self.next_observation()
    reward = self.take_action(action)
    done = self.is_done()

    if done:
        # here is last step from episode
        if self.is_good_game(): reward = 1
        else: reward = -1

    return obs, reward, done, {}

Peter_Pirog · May 22, 2022, 8:32pm

@sirjay I thought about the solution:

Train model with modified rewards
Save model to files
Load pretrained model from files and use environment with sparse rewards

How to save and load rllib files:
Ray RLlib: How to Save a Trained Agent for Later Use
Ray RLlib: How to Use and Record a Saved Agent

Topic		Replies	Views
Delayed assigning of rewards resp. punishments in single-agent RL RLlib	4	385	April 27, 2021
Scaling rewards depending on action distribution RLlib	2	372	November 3, 2021
My RLlib implementation seems to compute random actions RLlib	4	918	February 15, 2022
MultiAgentEnv Delayed rewards RLlib	2	33	June 3, 2025
Training for turn-based sequential games RLlib	4	579	January 21, 2023

How does agent know from what action it gets reward?

Related topics