Advice needed on learning curve interpretation

PhilippWillms · June 16, 2024, 2:32pm

Dear community,

my PPO training for a custom environment resulted in following outcome. I would need some advice how to interpret the results and where to start activities to improve. Due to narrow timeline of the project, I cannot afford losing energy and time while groping in the dark.

Thank you very much for all your support!

High-level info:

60 iterations with 6000 timesteps
After 60 iterations, I had 149 episodes (1 episode always consist of 40 steps within episode)
learning rate was 0.001 (1e-3)

The graphs were generated with the help of following code snippet:

# Unpack values from each iteration
rewards = np.hstack([i['hist_stats']['episode_reward'] for i in results])
pol_loss = [
    i['info']['learner']['default_policy']['learner_stats']['policy_loss'] 
    for i in results]
vf_loss = [
    i['info']['learner']['default_policy']['learner_stats']['vf_loss'] 
    for i in results]
p = 100
mean_rewards = np.array([np.mean(rewards[i-p:i+1]) 
                if i >= p else np.mean(rewards[:i+1]) 
                for i, _ in enumerate(rewards)])
std_rewards = np.array([np.std(rewards[i-p:i+1])
               if i >= p else np.std(rewards[:i+1])
               for i, _ in enumerate(rewards)])
fig = plt.figure(constrained_layout=True, figsize=(20, 10))
gs = fig.add_gridspec(2, 4)
ax0 = fig.add_subplot(gs[:, :-2])
ax0.fill_between(np.arange(len(mean_rewards)), 
                 mean_rewards - std_rewards, 
                 mean_rewards + std_rewards, 
                 label='Standard Deviation', alpha=0.3)
ax0.plot(mean_rewards, label='Mean Rewards')
ax0.set_ylabel('Rewards')
ax0.set_xlabel('Episode')
ax0.set_title('Training Rewards')
ax0.legend()
ax1 = fig.add_subplot(gs[0, 2:])
ax1.plot(pol_loss)
ax1.set_ylabel('Loss')
ax1.set_xlabel('Iteration')
ax1.set_title('Policy Loss')
ax2 = fig.add_subplot(gs[1, 2:])
ax2.plot(vf_loss)
ax2.set_ylabel('Loss')
ax2.set_xlabel('Iteration')
ax2.set_title('Value Function Loss')
plt.show()

starkj · June 20, 2024, 2:01am

Not enough info about the problem to provide specific advice. But in general, for RL look at rewards, not the loss curves. I suggest start by using print statements to understand what is being rewarded and by how much, and also the contents of obs and actions at each step. You could have an inadequate reward structure, essential items missing from the obs vector, defective environment interpreting the actions wrong, or simply a set of hyperparams that aren’t working well.

PhilippWillms · June 22, 2024, 11:33am

Hi @starkj,
thanks for the guidance, this triggered a closer look at the reward function design.

Below you see reward developmend plotted for typical episode. Keep in mind that each episode has fixed length, e.g. 40 steps in cases like below. The sequence of actions taken in an episode is the thing which the agent needs to learn a policy for.

The reward function is currently desgined in a way that it shows the iterative gain or loss towards the ultimate goal (measured at the end of the episode) at each timestep. Hence, you could call it a dense reward based on continuous values.

mannyv · June 23, 2024, 12:18am

Hi @PhilippWillms,

By iterative gain or loss at each time step do you mean a value for taking the current action or that value plus the sum of all the previous rewards since the beginning of the episode?

If the rewards were 1 on every timestep of an episode with 5 steps would your rewards be
a: 1, 1, 1, 1, 1

or b: 1, 2, 3, 4, 5

PhilippWillms · June 23, 2024, 1:03pm

It is rather a form as you describe under “b”, but not monotonous increasing.
b = 1, 2, -1, 5, 3, 4
so yes, “sum of all the previous rewards” is included, but the reward can be negative, indicating a malus when the agent takes action taking him away from the objective function.

mannyv · June 24, 2024, 1:52am

Hi @PhilippWillms,

In my past experiments on the environments I run, I have always had better performance when I structure the reward to follow type a.

You are providing an undiscounted return which will be accumulated again when producing the value targets.

Also I would recommend you make your training batch size be a multiple of 40 if that is always the episode length. This is so that you do not need to bootstrap the return for incomplete episodes in the batch.

Topic		Replies	Views
PPO only run several steps in one episode RLlib	1	33	September 10, 2024
Unexpected dramatic drop in reward RLlib	8	926	November 13, 2023
Training mean reward vs. evaluation mean rewward RLlib	4	1304	November 17, 2022
Help with Reward Plateaus and Missing Initial Episodes in PG Algorithm Training RLlib	0	14	November 29, 2024
Seeking recommendations for implementing Dual Curriculum Design in RLlib RLlib	13	656	April 11, 2023

Advice needed on learning curve interpretation

Related topics