Oscillating mean reward

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Im a bit new to Reinforcement learning but have tried multiple on/off policies and configurations and algorithms in RLLIb - PPO, A2C, R2D2. My issue is that mean reward are fluctuating dramaticly. Becomes a bit better but then “stabalizes” with great oscillations - like its not learned a lot. Cannot figure out if the problem within the environment fomulation or just hyperparams beeing too much off.

Im basicly looking for some general guidelines if someone can help out pointing me into a direction where I can search further.

Data:
Data is a sample of “games” where each game always ends after say 100 steps +/- 5.
Data is made up of several thousand games.
Rewards happen during each episode but magnitude often greater late in the game (positive / negative) .
Rewards are most often larger (positive / negative) if there is “a lot” of timesteps inbetween “correct” actions if looking at it in retrospect.

Information is “incomplete” in the sense that one cannot know the final outcome in advance.

There is great variations between games - think of it in conceptual terms of two strong players playing each other vs. a world champion playing a newbee - and everything in between. One having an excellent start or things are a stallmate until one has a breakthrough…

Goal
Max mean reward accross all games.

Environment:
Could anything be said in general about: Should the enviroment terminate (return done) when a single game ends and then randomly select a new game to start from within environement.reset(). Or somehow play all games in sequence within each iteration.

I assume the first based on the Lunar Landing gym environment where it also redefines e.g. the surface and then creates a new “layout” on reset.

Three actions where its from a human perspective cannot be ideal to have the same action for two different things. Like betting on the weather - It cannot be both sunny and snowing.

There will always be a reward between each timestep - small or larger or positive and negative - I would imagine more negatives than posivies but not extremely sparse.

Rewards will have both a stepwise behavior in the sense rewards can go from e.g. negative to positive between two timesteps, but also in general more smoothly going up or down.

I did try

  • Try on/off policies
  • play with ReplayBuffer for e.g. R2D2 after reading it would be helpfull in other scenarios with great variability.
  • Play with LSTM & Attention for e.g. PPO.
  • Train for longer e.g. couple of thousand iterations.
  • Play with hyperparams (kind of blindly…) - but started out with policy defaults.
  • different feature enginnering approaches
  • Calculate rewards by hand to verify its ok.

But no matter what it seem to end up in the same where the mean reward oscillates between iteration e.g. up and down within say [-30; 11] but with no converging.

Based on experience could anything be said about

  • How to setup the environment to handle individual games?
  • Does some hyper params need to be with a specific range like. train_batch_size, sgd_minibatch_size or whatever to make sense in relation to the “known” steps per game?
  • Anything else that based on experience woth a try - policies, configs, hyper params, reward function etc.

Anything that could help pinpoint some prime suspect(s). Anything is welcome :slight_smile:

Know its extremely broad but are not allowed to share data :frowning: