Issues reproducing stable-baselines3 PPO performance with rllib

mjlbach · July 27, 2021, 12:01am

Hi all,

SVL has recently launched a new challenge for embodied, multi-task learning in home environments called BEHAVIOR, as part of this we are recommending users start with ray or stable-baselines3 to get quickly spun up and to support scalable, multi-environment training.

We shipped a ray example, but I’ve had trouble replicating the PPO performance on a point navigation task in our environment. I went through and tried to match all settings and the model architecture from stable-baselines3, but I’ve been unable to replicate the results of stable-baselines3 in Ray. I was hoping I was doing something obviously wrong.

Here is the example repo: I’ve dockerized everything to make the results as reproducible as possible:

The one snag is, we have to distributed the models with a license agreement/encrypted. The instructions are in the readme in that repo, and it shouldn’t take more than a couple minutes for you to get approved. Please let me know if you have any questions, or if anything doesn’t work with the example. Note for ray, you may have to lower or raise the allocated CPU for your train workers.

mjlbach · July 27, 2021, 12:08am

Due to discourse limitations as a new account, I have to break this out into three separate posts to avoid the link limit:

ray PPO:

mjlbach · July 27, 2021, 12:08am

Stable-baselines3 PPO:

rliaw · July 27, 2021, 12:32am

Hey @stefanbschneider , I remember you had some experience working with stable-baselines. Do you know of any sort of immediate gotchas when doing this comparison?

stefanbschneider · July 27, 2021, 6:59am

Hi, I can’t think of anything in particular to look out for. Also, I only worked with stable-baselines2 not with sb3 so far; not sure how big the difference is.

When I switched to RLlib, I kept the PPO defaults of RLlib and they worked quite well for me. I think they were somewhat different than the sb2 default hyperparameters for PPO.
Did you try running PPO on RLlib with the RLlib default params?

I also noticed that I needed some more training steps to converge with RLlib in my example (but less training time due to parallelization).
Here, I don’t think that’s the issue since there’s no learning/convergence at all after 1M+ train steps…

Are you using the same environment and reward etc in both cases? I think stable_baselines has some built-in filters/normalizers for observations that could make a difference.
Hm, sorry, I’m just guessing here.

You could also have a look at the stable_baselines2 to RLlib example here:

SB2: ray/sb2rllib_sb_example.py at master · ray-project/ray · GitHub
Equivalent RLlib code: ray/sb2rllib_rllib_example.py at master · ray-project/ray · GitHub

Maybe there’s something there that helps.

mjlbach · July 27, 2021, 4:41pm

Hi Stefan,

Thanks for the response! I did try the default PPO implementation in ray (as the first thing, before I tried matching the models as in the example repo). With the default options, it does perform better but the episode reward mean only negligibly improves (converges to around -2), and the average steps per episode converges to 480.

Only after I noticed that I was not matching the performance of SB3 did I comb through the stable baselines implementation (including the referenced ray implementation you linked), and tried to match all hyperparameters/model architecture of ray with the stable-baselines3 models (sb3 is written in pytorch, and is community driven).

I’m using the exact same environment and reward, you can see the example in the repository I linked. The only difference is ray (and possibly any settings, filters, or model differences I failed to match, although I tried to make these as aligned as possible).

Best,
Michael

sven1977 · July 27, 2021, 5:25pm

Hey @mjlbach , thanks for raising this issue. And thanks @stefanbschneider for your responses!

@mjlbach , what ray version are you on?

One thing that comes to mind is our recent change to always learn in normalized action spaces, which was only introduced in 1.4 and it makes learning in cont. actions much more stable.

mjlbach · July 27, 2021, 5:41pm

Hi @sven1977,

I’m on ray 1.4.0 (I see 1.5 is out now), I’m definitely happy to dig into this a bit if you all have any pointers I’ve already gone through the implementations and nothing at face value looks substantially different, but I haven’t checked on observation normalization/filters in detail.

Best,
Michael

sven1977 · July 27, 2021, 5:59pm

Sorry, correction: The action normalization improvement was introduced in 1.5(!) (not 1.4).
Would you be able to give it one more shot with the latest 1.5 version?

mjlbach · July 29, 2021, 6:04am

Sorry for the delay. I added two additional scripts, ray_defaults, and ray_defaults_deeper (which exactly matches the model used in stable-baselines3, but seemed to do a bit worse) and re-trained with ray 1.5.

ray_defaults:

ray_defaults_deeper: (see next post, I still have a 1 image per post limit)

The latter is still training, I’ll try the deeper model with the hyperparameters I scraped from stable-baselines, it’s possible the latter model will continue to improve but it’s lagging behind stable-baselines by a good amount so far.

mjlbach · July 29, 2021, 6:08am

ray_defaults_deeper:

edit: looks like it has not really improved or conveged, I trained the above models up to ~7 million steps and there was no performance improvement.

I also tried training the exact model with the matched architecture with the hyperparameters matched to ray, it performed approximately the same as the above. I’m not quite sure what else to try tweaking.

rliaw · August 9, 2021, 5:27pm

BTW, @mjlbach sorry for the lagging response here. Is there a repro script that we can run on our side?

mjlbach · August 10, 2021, 1:01pm

Yep! In the top post, repasting the link here: GitHub - mjlbach/iGibson-ray-repro

MatiasCova · October 12, 2021, 1:20pm

Maybe this is related?

github.com/ray-project/ray

[Bug] PPO value function loss is incorrect

opened 03:54PM - 11 Oct 21 UTC

Acciorocketships

bug triage

### Search before asking - [X] I searched the [issues](https://github.com/ray-p…roject/ray/issues) and found no similar issues. ### Ray Component RLlib ### What happened + What you expected to happen I think the calculation of the clipped value function loss in PPO is incorrect. In the current implementation, `vf_loss1` is the error between the current value function output and the target. `vf_loss_clipped` is the difference between the previous value function and the current value function, clamped with a config clip parameter. `vf_loss2` is the error between `vf_loss_clipped` and the target. Then, the total value loss is computed as the mean of the MAX of `vf_loss1` and `vf_loss2`. This doesn't make sense, because the clipping parameter really should be _clipping the value of the loss_. However, in the current implementation, the loss can be much bigger than the clipping parameter. For example, even if the clipping parameter is set to 0, then the loss is still either the difference between the current or the last value function output and the target (whichever one is larger). Let me provide a small example to demonstrate why this implementation doesn't make sense. let's say that the target value is 1.0, and the last value function output was 0.0, but since the last iteration the model improved a bit so the value function output is currently 0.8. Let us also assume that our clipping parameter is 0.1. Then, we would calculate `vf_loss1 = (1.0 - 0.8)**2 = 0.04`, `vf_loss_clipped = 0.0 + clamp(0.8 - 0.0, 0.1, -0.1) = 0.1`, `vf_loss2 = (1.0 - 0.1)**2 = 0.81`, `vf_loss = max(0.04, 0.81) = 0.81`. So, our loss is 0.81 even though we are very close to our target. This will cause us to overshoot. As you can see, the clip parameter did nothing to actually clamp the value function loss, and the max operation caused us to choose a loss that was far too large. In contrast, I would expect the vf loss to be calculated by simply clamping `vf_loss1`: `vf_loss = clamp( (value_target - curr_value_function_out)**2, clip_param, -clip_param)` cc @smorad ### Reproduction script from `ppo_torch_policy.py:88`: ```python prev_value_fn_out = train_batch[SampleBatch.VF_PREDS] value_fn_out = model.value_function() vf_loss1 = torch.pow( value_fn_out - train_batch[Postprocessing.VALUE_TARGETS], 2.0) vf_clipped = prev_value_fn_out + torch.clamp( value_fn_out - prev_value_fn_out, -policy.config["vf_clip_param"], policy.config["vf_clip_param"]) vf_loss2 = torch.pow( vf_clipped - train_batch[Postprocessing.VALUE_TARGETS], 2.0) vf_loss = torch.max(vf_loss1, vf_loss2) mean_vf_loss = reduce_mean_valid(vf_loss) ``` ### Anything else proposed updated code: ```python prev_value_fn_out = train_batch[SampleBatch.VF_PREDS] value_fn_out = model.value_function() vf_loss1 = torch.pow( value_fn_out - train_batch[Postprocessing.VALUE_TARGETS], 2.0) vf_loss = torch.clamp( vf_loss1, -policy.config["vf_clip_param"], policy.config["vf_clip_param"]) mean_vf_loss = reduce_mean_valid(vf_loss) ``` ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR!

I am using RLLIB PPO for my thesis I need to be able to trust it completely. Is it feasible that PPO has an implementation bug on the value function calculations?

mjlbach · March 16, 2022, 4:17pm

After this PR my issue was resolved!

Topic		Replies	Views
PPO results do not match StableBaselines3 results with same settings RLlib	0	132	September 1, 2023
Migrating from StableBaselines3, not able to reproduce results RLlib	1	102	April 14, 2024
From stable-baselines3 to ray rl RLlib	2	808	May 31, 2022
Performance of algorithms RLlib	3	619	September 2, 2021
A little help for a novice RLlib	1	429	October 26, 2022

Issues reproducing stable-baselines3 PPO performance with rllib

Related topics