Issue with custom environment

Hi, I’m using a custom environment, I was using stable-baslines3 for training but I’m trying to migrate to RLlib for scaling purposes. However after several training iterations, tuning hyperparams my model never converged to a result nearly as good as in the stabe-baselines3 and I tried to run a episode with a trained file to see what is going on. My agent is always using the maximum actions like:

array([-1., 1., 1., 1., -1.], dtype=float32)

(Actions normalized between -1 and 1)

However my environment does not have a bug because I’m running OK in the stable-baselines3 and I don’t think there is a issue with my RLlib config, therefore I don’t know why my agent is bugging in those extreme actions.

Hey @PatrickSampaioUSP , thanks for posting this issue. Would you be able to share your environment and config so we can take a look? It’s impossible to figure out why RLlib would not learn otherwise. Thanks.

I have a reproduction in Issues reproducing stable-baselines3 PPO performance with rllib