RLlib benchmarking

I’m searching for RLlib benchmark results.
(I’d simply like to validate e.g. PPO on Mujoco’s Walker2d where I experience significantly inferior RLlib performance compared to e.g. SB3)

It seems likely that RLlib benchmark results are not directly included in the example scripts like benchmark_ppo_mujoco.py; these scripts provide configurations and stopping criteria for PPO on MuJoCo environments (including Walker2d), but do not contain actual benchmark numbers or logs within the file itself. The rl-experiments repo is referenced as a place for benchmark results, but it may not always be up to date or contain the latest results for every environment or algorithm combination you are interested in.

My understanding is that for the most current and detailed RLlib benchmark results, you should check both the rl-experiments repo and the tuned_examples directory, as suggested in the RLlib developer documentation. These locations are where the community is encouraged to contribute and update benchmark results, but there is no guarantee of completeness or recency for every environment. Would you like more detail on how to run your own benchmarks or interpret the provided example scripts?

Sources:

Hint: Mention @RunLLM in the post for followups.