How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi!
I am currently working on an automatic train simulator ; the goal is to train a model on driving a train according to various constraints (essentially speed limits and stopping point).
The simulator is pretty simple : the action is the command of acceleration, then speed and position are calculated and observed by the actor, as well as the speed limits of the line. These speed limits consists of a fixed-size list : [[speed_limit_1, distance_1], [speed_limit_2, distance_2], …] ; distances are updated every cycle to represent the distance remaining until the speed limits is applied. The speed limit “under” the train has a distance associated of 0 ; past speed limits are set to [1e4, 1e4] (as a constraining speed limit is essentially a couple of small numbers, I thought that “big numbers” would be understood by the algorithm as not (or less) constraining).
To get a good reward, agent must respect the speed limits as well as the stopping point, and a bonus is granted inversely proportionally to the
I can provide the code if needed.
The problem is, I get very weird results while training with this environment. It seems that PPO produces the “best” results across all available algorithms of RLlib, but they are not very consistant. Every time, the reward function gets quite high, but it eventually drops until the end of the training. Here are 4 runs with the same parameters (only changing the number of training iterations) :
Here are my parameters for PPO :
config = {
'lr': 1e-4,
'vf_clip_param': 1e8
}
I am pretty new to RL, so there may be an obvious explanation to this behavior, but after a lot of tests and research, I can’t figure it out.
Thanks in advance if you can help me!