Hi everyone,
I’m encountering some issues while training my agent on a custom environment. Specifically, the value of result[“episode_reward_mean”] remains constant for several iterations during training. This behavior is significantly affecting the agent’s performance.
After searching through the forums, I found that this might be related to how Ray RLlib handles episode termination. I tried introducing the horizon parameter in my configuration dictionary, but the situation hasn’t improved.
I have some questions about how the horizon parameter works:
For instance, if its value is set to 1000 and my episode ends at step 900, I assume everything should work fine. But what happens if my environment doesn’t terminate even at step 1000? My episodes can have variable lengths, and I can’t predict their exact duration in advance.
Why, in your opinion, am I still facing this issue despite using the horizon parameter?
Here’s my configuration dictionary:
config = {
"env": Custom_Env
"env_config": env_config,
"exploration_config": {
"type": "StochasticSampling"
},
"model": {
"custom_model": CustomNet,
},
"lr": 0.95,
"gamma": 0.001,
"num_workers": 12,
"num_envs_per_worker": 8,
"num_gpus": 1,
"horizon": 1000
}
I’m using the PG algorithm.
Below, I’ve attached a plot of result[“episode_reward_mean”] over the training period. As you can see, there are noticeable plateaus in the graph.
Any insights or suggestions would be greatly appreciated!
Regards,
L.E.O.