Reward function not converging during training

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi!

I am currently working on an automatic train simulator ; the goal is to train a model on driving a train according to various constraints (essentially speed limits and stopping point).
The simulator is pretty simple : the action is the command of acceleration, then speed and position are calculated and observed by the actor, as well as the speed limits of the line. These speed limits consists of a fixed-size list : [[speed_limit_1, distance_1], [speed_limit_2, distance_2], …] ; distances are updated every cycle to represent the distance remaining until the speed limits is applied. The speed limit “under” the train has a distance associated of 0 ; past speed limits are set to [1e4, 1e4] (as a constraining speed limit is essentially a couple of small numbers, I thought that “big numbers” would be understood by the algorithm as not (or less) constraining).
To get a good reward, agent must respect the speed limits as well as the stopping point, and a bonus is granted inversely proportionally to the
I can provide the code if needed.

The problem is, I get very weird results while training with this environment. It seems that PPO produces the “best” results across all available algorithms of RLlib, but they are not very consistant. Every time, the reward function gets quite high, but it eventually drops until the end of the training. Here are 4 runs with the same parameters (only changing the number of training iterations) :

Here are my parameters for PPO :

config = {
'lr': 1e-4,  
'vf_clip_param': 1e8
}

I am pretty new to RL, so there may be an obvious explanation to this behavior, but after a lot of tests and research, I can’t figure it out.

Thanks in advance if you can help me!

Hi @leo593,

Welcome to the forum! Do you have a full reproduction script you could share?
Which version of ray are you using?

Manny

Hi @mannyv

Sorry for not providing this information.

Here is my environment:

from gym import Env
from gym.spaces import Box, Dict
import numpy as np
import matplotlib.pyplot as plt
import plotext
import math

inf=1e4
TICK = 0.5
GAMMA_TRACTION_MAX = 1.5
GAMMA_FS_MAX = -1.5
VMAX = 38.8
PAEX = 1000.
PAF = 1100.
LIMITE_DOMAINE = PAF + 130
SPEED_LIMIT = np.array([[0., 30.], [250., VMAX], [800., 20.], [PAEX, 0], [PAF, 0], [LIMITE_DOMAINE, 0]], dtype=np.float32)
PENALTY_OVERSPEED = 10
PENALTY_DEPASSEMENT = 1000
BONUS_SPEED = 1000
EPSILON = 25

class SimuMiniEnv_v0(Env):
    MAX_STEPS = 10000

    def __init__(self):

        self.action_space = Box(low=np.array([GAMMA_FS_MAX / 10]), high=np.array([GAMMA_TRACTION_MAX / 10]))
        
        spaces = {
            'acceleration': Box(low=GAMMA_FS_MAX, high=GAMMA_TRACTION_MAX, shape=(1,), dtype=np.float32),
            'speed': Box(low=0, high=VMAX, shape=(1,), dtype=np.float32),
            'position': Box(low=0, high=LIMITE_DOMAINE+EPSILON, shape=(1,), dtype=np.float32),
            'LV': Box(low=-inf, high=inf, shape=(6, 2), dtype=np.float32),
        }
        self.observation_space = Dict(spaces)
        self.state = {
            'acceleration': np.array([0.]),
            'speed': np.array([0.]),
            'position': np.array([0.]),
            'LV': SPEED_LIMIT,
        }
        self.acceleration = [0.]
        self.speed = [0.]
        self.position = [0.]

    def step(self, action):

        bonus = 0
        acc_next_step = self.acceleration[-1] + action[0]

        if acc_next_step < GAMMA_FS_MAX:
            acceleration_temp = GAMMA_TRACTION_MAX
        elif acc_next_step > GAMMA_TRACTION_MAX:
            acceleration_temp = GAMMA_TRACTION_MAX
        else:
            acceleration_temp = acc_next_step

        speed_temp = acceleration_temp * TICK + self.speed[-1]
        if speed_temp < 0:
            acceleration_temp = self.speed[-1] / TICK
            speed_temp = 0.
        elif speed_temp > VMAX:
            acceleration_temp = (VMAX - self.speed[-1]) / TICK
            speed_temp = VMAX

        self.acceleration.append(acceleration_temp)
        self.speed.append(speed_temp)

        self.position.append(self.speed[-1] * TICK + self.position[-1])

        SL_updated = np.add(SPEED_LIMIT, np.array([-self.position[-1], 0]))

        index_current_SL = [np.where(SL_updated[:, 0] <= 0)[0][-1], 1]

        current_SL = SL_updated[tuple(index_current_SL)]  

        SL_updated[SL_updated < 0] = inf
        SL_updated[index_current_SL[0], 0] = 0

        for i in range(np.size(SL_updated, 0)):
            if SL_updated[i, 0] == inf:
                SL_updated[i, 1] = inf

        done = False
        self.simone_length += TICK  

        if self.speed[-1] >= current_SL:  
            bonus -= PENALTY_OVERSPEED

        if self.position[-1] >= PAEX:
            if self.speed[-1] <= 0.1:
                done = True
                bonus += self.simone_length * TICK * BONUS_SPEED
            else:
                bonus -= PENALTY_DEPASSEMENT / 10

        if self.position[-1] >= PAF:
            if self.speed[-1] <= 0.01:
                done = True
            else:
                bonus -= PENALTY_DEPASSEMENT
                done = True

        if self.simone_length / TICK >= self.MAX_STEPS:
            bonus -= PENALTY_DEPASSEMENT * 10
            done = True

        if self.position[-1] >= LIMITE_DOMAINE:
            done = True

        if self.speed[-1] <= 0.001 :
            bonus -= PENALTY_OVERSPEED / 10

        info = {} 

        state = {
                'acceleration': np.array([self.acceleration[-1]]),
                'speed': np.array([self.speed[-1]]),
                'position': np.array([self.position[-1]]),
                'LV': SL_updated,
        }
        self.state = state

        return self.state, bonus, done, info

    def render(self):
        pass

    def reset(self):
        self.state = {
            'acceleration': np.array([0]),
            'speed': np.array([0]),
            'position': np.array([0]),
            'LV': SPEED_LIMIT,
        }
        self.position = [0]
        self.speed = [0]
        self.acceleration = [0]
        return self.state

and here is how I setup my training :

register_env("SimuMini-v0", lambda config: SimuMiniEnv_v0())
results=ray.tune.run(
        "PPO",
        stop={"training_iteration": 300},
        config={"env": "SimuMini-v0", "lr": 1e-4, "vf_clip_param": 1e8}
    )

I’m using ray 1.12.1 with Python 3.9 (macOS 11 in case that’s useful)

Hi!

Has this problem be solved by others? If not, here are my thoughts on what might hinder learning:
From my understanding, the speed limit the train should obey is at distance 0 and past ones at distance 1e4, ist that correct? What are typical distances for future speed limits?
If these are orders of magnitude higher, I could imagine early rewards to be reached by merely learning to accelerate, but not by effectively mapping the current speed limit + speed to the acceleration because they are almost indistinguishable. This could be resolved by mapping past speed limits to their (negative) distance.

Other than that: Can you please post entropy loss and KL divergence here?

Best

Hi @arturn

I’m still trying to solve this problem, thanks for your contribution!

From my understanding, the speed limit the train should obey is at distance 0 and past ones at distance 1e4, ist that correct?

Yes it is! I came with this idea with the fact that: a constraining speed limit is a couple of 2 ‘small’ numbers (both speed and distance). The speed limit list is defined in the first lines of my environment, distances range from 0 to 1100, and speed limits from 0 to 40.

If these are orders of magnitude higher, I could imagine early rewards to be reached by merely learning to accelerate, but not by effectively mapping the current speed limit + speed to the acceleration because they are almost indistinguishable. This could be resolved by mapping past speed limits to their (negative) distance.

That’s a good idea! However, when trying it, I got roughly the same results as the previous behavior.
Here are 3 runs, the only difference is the number of iterations.



I’m also wondering if randomizing speed limits at the beginning of each episode would help the learning process in this case ?

I’d like to second @mannyv 's request for a reproduction script.

Everything is in my 2nd post: my custom gym env & the code I use for training. I don’t know how to aggregate it into a single script.

@leo593,

For my setup that was showing similar behavior I got to to stop “crashing” by setting grad_clip in the config. It did slow down learning though.

Interesting! How did you figure out the value at which setting ‘grad_clip’? I didn’t found much information about it

@leo593 I guessed :laughing:

@mannyv Actually that’s wierd; when I set grad_clip (with any value other than None), I get an arror (NotImplementedError). But maybe I should open a new topic for this issue

Hi @leo593 ,
Please provide a reproduction script to produce that error!