Maximum recommended reward

evo11x · July 5, 2022, 7:21pm

What is the maximum reward per step which should be used?
I have seen some environments with rewards as high as 100, should I use only rewards between 0 and 1 ?

Medium: It contributes to significant difficulty to complete my task, but I can work

Peter_Pirog · July 6, 2022, 9:02pm

@evo11x There is no strict rule what is the best reward value. Important is relative relationship between the lowest and the highest value. The reward value shows for the algorithm how good or how bad is the result. Typically if you want to reduce number of usefulness steps you can define negative penalty, for example to avoid infinite number of steps if you try to solve labyrinth.

Personally, I understand idea of rewards when I have played as a kid in the game Majesty Kingdom Simulator. In this game You can’t move your knights directly but You can define rewards for some tasks like kill the dragon or something else.
For example:

If your task 1 has reward +100 gold and task 2 has reward +100 gold, your knight choose the nearest task
If your task 1 has reward +100 gold and task 2 has reward +200 gold, your knight choose the task 2
If you set rewards +100 gold for both tasks or +300 for both tasks the effect is similar because reward for task 1 and task 2 is equal - there is no reason to prefer some task

I recommend game Majesty to understand and feel idea of rewards and indirect control of the process

evo11x · July 6, 2022, 9:23pm

Thanks!, I know the idea of rewards in order to guide the learning. But I have a problem, the larning is stalling after about 2-3 milion steps. I have rewards up to 100 and I thought that this is the problem.

Peter_Pirog · July 6, 2022, 9:28pm

Can you write which environment do you want to solve and how You defined rewards?

avnishn · July 6, 2022, 9:35pm

@Peter_Pirog has some good advice here but I’ll add a little more detail here.

Generally speaking, non-sparse reward functions that are smooth and differentiable are often the best reward functions for your problem. Neural networks typically do well with small linear reward ranges, so ranging between 0 and 100 should be fine so long as the sequence of actions to take in an optimal episode map to a smooth increasing reward function, that doesn’t have extremely large derivatives at any point.

evo11x · July 6, 2022, 9:51pm

@Peter_Pirog I have a complex environment but only with 2 actions which needs a complex strategy. It is able to resolve only the simple moves and is unable to learn more advanced moves.

@avnishn Thanks for the tip! I think I have larger changes in the rewards and is not that smooth.

Peter_Pirog · July 6, 2022, 9:56pm

Ideas:

Maybe your model has not enoungt complexity. Tryto use more layers and more neurons.
Try to use smaller learning rate, maybe lr is to big and there is a problem with convergence.

evo11x · July 6, 2022, 11:38pm

@Peter_Pirog I have added layers up to [1024,1024,1024] and it did not help very much.

I found that by addig 4-5 layers of 256 neurons helps a lot with learning (not sure if this is not too much?).
Now I have changed the rewards with lower deviations between steps and seems that the learning has improved.

Peter_Pirog · July 7, 2022, 4:03am

Do You have plots how the training process looks to analyze convergence and mean reward.
You need plots for training and evaluation.

What algorithms do You use?
Do you try use ray.tune to find best hyperparameters ?

evo11x · July 7, 2022, 10:25am

I use APPO, I have only the reward plots from ray. What kind of plots are you talking about?
No, I don’t use ray.tune yet, because I am struggling to make it learn.
The reward min and mean is going down instead of up, the maximum is going up and the len is not reaching the maximum of 100, because the actor choose to die instead of learning and if I increase the “game over” penalty then it chooses to do almost nothing to maintain a fixed reward.
I need the reward mean to be above 0 in order to have a good result.

I am thinking to use offline data to get it moving in the right direction, that’s why I asked this question here

Lars_Simon_Zehnder · July 13, 2022, 6:34am

@evo11x from what I read here - especially from your last reply - it sounds as if your agent is not exploring enough. It learned to collect some rewards by some rather not very sophisticated moves. It still does so because it does not know better. To know better it needs to explore more and this only happens if it either acts more randomly or gets rewarded for exploration itself.

I would try to either increase the stochasticity of actions or try out some of RLlibs more sophisticated exploration algorithms, like Curiosity or Random Encoder.

evo11x · July 13, 2022, 5:37pm

Thanks @Lars_Simon_Zehnder !
I have tried to add exploration config, but I get this error
ValueError: Only (Multi)Discrete action spaces supported for Curiosity so far!

How can I increase the stochasticity of the actions?

Lars_Simon_Zehnder · July 13, 2022, 9:16pm

@evo11x , if you are working with continuous actions Curiosity can not be used, try instead the Random Encoder. That one works with continuous actions. I am also working on further exploration algorithms that might come with the 2.0.0 or a later release.

For more stochasticity in the actions you can also increase the probability in epsilon greedy (via initial_epsilon and final_epsilon in the exploration_config) or by increasing the random_timesteps parameter when using stochastic sampling. Try some exploration modules and see what your agent does. Look into the states and actions to understand what might be really its problem.

You can output your trial data by using the Offline API - add "output": "path/to/your/data/directory" and study some data. Maybe your environment also has a rendering function that you can use.

evo11x · July 14, 2022, 7:42am

@Lars_Simon_Zehnder thanks!
But if I increase the random_timesteps can it affect the prediction? (when I restore the training and use compute_single_action). I don’t want too much randomness which may end up in the trained model prediction.
I have tried to change the rewards then restore and re-train and it seems to help.

Lars_Simon_Zehnder · July 14, 2022, 7:24pm

@evo11x the prediction depends on what your agent has learned. The main tasks are search and learn or find out what works really well and then repeat this. If your agent has not even found the good working moves, everything it has learned won’t have result in much return. In turn, the predictions later on will probably not lead to a good result. This result becomes better when a better balance between exploration and exploitation is found.

You are right, if on the other extreme you use too much exploration the agent cannot learn good strategies as good experiences might be never learned - even worse there could be catastrophic forgetting. It’s all about a good balance.

This is what you can test for by making several runs with different hyperparameter random_timesteps (or when using epsilon greedy with the initial_epsilon and final_epsilon) and observe how your agent is learning in all these runs. You can do this easily by using Ray’s hyperparameter tuning engine tune.

evo11x · July 14, 2022, 7:37pm

thanks! I will try those parameters.
I am not yet very familiar with tune, I have tried it a few times, but in order to get a good result I need to train the agend over 100mil steps which would make tune run forever trying different parameters.

Lars_Simon_Zehnder · July 14, 2022, 7:41pm

Well, I do not know about your problem setting, but use tune myself to also investigate if there are maybe even at early points in time some differences with different hyperparameters … and to not wait 5 times as long for a result when testing 5 different hyperparameter combinations.

evo11x · July 14, 2022, 7:47pm

@Lars_Simon_Zehnder this is how the learning start for me, I need at least 40mil steps until it reaches a reasonable reward

Lars_Simon_Zehnder · July 14, 2022, 7:50pm

Thanks for the elaboration. I would probably think about running several runs simultaneously to just wait once or twice until I receive my results.
This time of 40 Mio ts might also improve by exploration itself, as the agent might find faster some moves that work good.

Topic		Replies	Views
Oscillating mean reward RLlib	1	458	June 21, 2024
Scaling rewards depending on action distribution RLlib	2	351	November 3, 2021
PPO Reward Scaling RLlib	2	1140	September 3, 2021
How can i use the end of game reward as every steps reward? RLlib	4	692	November 27, 2021
Configuration for infinite horizon (continuous/non-episodic) environments? Configure Algorithm, Training, Evaluation, Scaling	0	36	July 12, 2024

Maximum recommended reward

Related topics