SVL has recently launched a new challenge for embodied, multi-task learning in home environments called BEHAVIOR, as part of this we are recommending users start with ray or stable-baselines3 to get quickly spun up and to support scalable, multi-environment training.
We shipped a ray example, but I’ve had trouble replicating the PPO performance on a point navigation task in our environment. I went through and tried to match all settings and the model architecture from stable-baselines3, but I’ve been unable to replicate the results of stable-baselines3 in Ray. I was hoping I was doing something obviously wrong.
Here is the example repo: I’ve dockerized everything to make the results as reproducible as possible:
The one snag is, we have to distributed the models with a license agreement/encrypted. The instructions are in the readme in that repo, and it shouldn’t take more than a couple minutes for you to get approved. Please let me know if you have any questions, or if anything doesn’t work with the example. Note for ray, you may have to lower or raise the allocated CPU for your train workers.
Hey @stefanbschneider , I remember you had some experience working with stable-baselines. Do you know of any sort of immediate gotchas when doing this comparison?
Hi, I can’t think of anything in particular to look out for. Also, I only worked with stable-baselines2 not with sb3 so far; not sure how big the difference is.
When I switched to RLlib, I kept the PPO defaults of RLlib and they worked quite well for me. I think they were somewhat different than the sb2 default hyperparameters for PPO.
Did you try running PPO on RLlib with the RLlib default params?
I also noticed that I needed some more training steps to converge with RLlib in my example (but less training time due to parallelization).
Here, I don’t think that’s the issue since there’s no learning/convergence at all after 1M+ train steps…
Are you using the same environment and reward etc in both cases? I think stable_baselines has some built-in filters/normalizers for observations that could make a difference.
Hm, sorry, I’m just guessing here.
You could also have a look at the stable_baselines2 to RLlib example here:
Thanks for the response! I did try the default PPO implementation in ray (as the first thing, before I tried matching the models as in the example repo). With the default options, it does perform better but the episode reward mean only negligibly improves (converges to around -2), and the average steps per episode converges to 480.
Only after I noticed that I was not matching the performance of SB3 did I comb through the stable baselines implementation (including the referenced ray implementation you linked), and tried to match all hyperparameters/model architecture of ray with the stable-baselines3 models (sb3 is written in pytorch, and is community driven).
I’m using the exact same environment and reward, you can see the example in the repository I linked. The only difference is ray (and possibly any settings, filters, or model differences I failed to match, although I tried to make these as aligned as possible).
One thing that comes to mind is our recent change to always learn in normalized action spaces, which was only introduced in 1.4 and it makes learning in cont. actions much more stable.
I’m on ray 1.4.0 (I see 1.5 is out now), I’m definitely happy to dig into this a bit if you all have any pointers I’ve already gone through the implementations and nothing at face value looks substantially different, but I haven’t checked on observation normalization/filters in detail.
Sorry, correction: The action normalization improvement was introduced in 1.5(!) (not 1.4).
Would you be able to give it one more shot with the latest 1.5 version?
Sorry for the delay. I added two additional scripts, ray_defaults, and ray_defaults_deeper (which exactly matches the model used in stable-baselines3, but seemed to do a bit worse) and re-trained with ray 1.5.
ray_defaults_deeper: (see next post, I still have a 1 image per post limit)
The latter is still training, I’ll try the deeper model with the hyperparameters I scraped from stable-baselines, it’s possible the latter model will continue to improve but it’s lagging behind stable-baselines by a good amount so far.
edit: looks like it has not really improved or conveged, I trained the above models up to ~7 million steps and there was no performance improvement.
I also tried training the exact model with the matched architecture with the hyperparameters matched to ray, it performed approximately the same as the above. I’m not quite sure what else to try tweaking.
I am using RLLIB PPO for my thesis I need to be able to trust it completely. Is it feasible that PPO has an implementation bug on the value function calculations?