Advice needed on learning curve interpretation

Dear community,

my PPO training for a custom environment resulted in following outcome. I would need some advice how to interpret the results and where to start activities to improve. Due to narrow timeline of the project, I cannot afford losing energy and time while groping in the dark.

Thank you very much for all your support!

High-level info:

  • 60 iterations with 6000 timesteps
  • After 60 iterations, I had 149 episodes (1 episode always consist of 40 steps within episode)
  • learning rate was 0.001 (1e-3)

The graphs were generated with the help of following code snippet:

# Unpack values from each iteration
rewards = np.hstack([i['hist_stats']['episode_reward'] for i in results])
pol_loss = [
    i['info']['learner']['default_policy']['learner_stats']['policy_loss'] 
    for i in results]
vf_loss = [
    i['info']['learner']['default_policy']['learner_stats']['vf_loss'] 
    for i in results]
p = 100
mean_rewards = np.array([np.mean(rewards[i-p:i+1]) 
                if i >= p else np.mean(rewards[:i+1]) 
                for i, _ in enumerate(rewards)])
std_rewards = np.array([np.std(rewards[i-p:i+1])
               if i >= p else np.std(rewards[:i+1])
               for i, _ in enumerate(rewards)])
fig = plt.figure(constrained_layout=True, figsize=(20, 10))
gs = fig.add_gridspec(2, 4)
ax0 = fig.add_subplot(gs[:, :-2])
ax0.fill_between(np.arange(len(mean_rewards)), 
                 mean_rewards - std_rewards, 
                 mean_rewards + std_rewards, 
                 label='Standard Deviation', alpha=0.3)
ax0.plot(mean_rewards, label='Mean Rewards')
ax0.set_ylabel('Rewards')
ax0.set_xlabel('Episode')
ax0.set_title('Training Rewards')
ax0.legend()
ax1 = fig.add_subplot(gs[0, 2:])
ax1.plot(pol_loss)
ax1.set_ylabel('Loss')
ax1.set_xlabel('Iteration')
ax1.set_title('Policy Loss')
ax2 = fig.add_subplot(gs[1, 2:])
ax2.plot(vf_loss)
ax2.set_ylabel('Loss')
ax2.set_xlabel('Iteration')
ax2.set_title('Value Function Loss')
plt.show()

Not enough info about the problem to provide specific advice. But in general, for RL look at rewards, not the loss curves. I suggest start by using print statements to understand what is being rewarded and by how much, and also the contents of obs and actions at each step. You could have an inadequate reward structure, essential items missing from the obs vector, defective environment interpreting the actions wrong, or simply a set of hyperparams that aren’t working well.

1 Like

Hi @starkj,
thanks for the guidance, this triggered a closer look at the reward function design.

Below you see reward developmend plotted for typical episode. Keep in mind that each episode has fixed length, e.g. 40 steps in cases like below. The sequence of actions taken in an episode is the thing which the agent needs to learn a policy for.

The reward function is currently desgined in a way that it shows the iterative gain or loss towards the ultimate goal (measured at the end of the episode) at each timestep. Hence, you could call it a dense reward based on continuous values.

Hi @PhilippWillms,

By iterative gain or loss at each time step do you mean a value for taking the current action or that value plus the sum of all the previous rewards since the beginning of the episode?

If the rewards were 1 on every timestep of an episode with 5 steps would your rewards be
a: 1, 1, 1, 1, 1

or b: 1, 2, 3, 4, 5

It is rather a form as you describe under “b”, but not monotonous increasing.
b = 1, 2, -1, 5, 3, 4
so yes, “sum of all the previous rewards” is included, but the reward can be negative, indicating a malus when the agent takes action taking him away from the objective function.

Hi @PhilippWillms,

In my past experiments on the environments I run, I have always had better performance when I structure the reward to follow type a.

You are providing an undiscounted return which will be accumulated again when producing the value targets.

Also I would recommend you make your training batch size be a multiple of 40 if that is always the episode length. This is so that you do not need to bootstrap the return for incomplete episodes in the batch.

1 Like