PPO not able to learn from an env as simple as Pendulum-v1?

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

Observing it is hard to find from Ray document a full runnable example training on an environment with continuous action-space. I tried to run it on the simplest env Pendulum-v1. However, the PPO model doesn’t seem to be learning from the env. Is this behavior expected?

My testing code running with Ray==2.1.0 and Python==3.8 on Sagemaker:

import ray
from ray import air, tune

tuner = tune.Tuner(
    'PPO',
    tune_config=tune.TuneConfig(
            metric="episode_reward_mean",
            mode="max",
            num_samples=1,
            #max_concurrent_trials=1,
        time_budget_s=60*60*8
        ),
    run_config=air.RunConfig(
        #stop={"time_total_s": 60*60*8},
        checkpoint_config=air.CheckpointConfig(
            checkpoint_at_end=True,
            checkpoint_frequency=200,
            num_to_keep = 5),
        local_dir = '/home/ec2-user/SageMaker/ray_results2',
        callbacks=[MLflowLoggerCallback(tracking_uri='http://localhost:5000', experiment_name="dev", )]
    ),
    
    param_space={
            "env": "Pendulum-v1",
#             "clip_actions": True,
            "num_workers": 4,
            "num_gpus": 0, # number of GPUs to use
            "framework": 'torch', # tf|tf2|torch
        },
    
)

tuner.fit()

Also tried on stable baseline 3 which seems to be working fine:

from stable_baselines3 import PPO
import gym
model = PPO(
#     'MultiInputPolicy',
    'MlpPolicy',
    gym.make('Pendulum-v1'),
    gamma=0.98,
    # Using https://proceedings.mlr.press/v164/raffin22a.html
    use_sde=True,
    sde_sample_freq=4,
    learning_rate=1e-3,
    verbose=1,
)

# Train the agent
model.learn(total_timesteps=int(1e6))

It’s achieving mean_reward of about -200 after running for a few minutes:

Hi,

RLlib does not automatically tune hyperparameters.
This is a critical part of reinforcement learning though.
If you would use Ray tune to tune some of the more important ones in PPO, you will get good results.
Have a look at our tuned examples if you are looking for such configurations.
For example, here is our tuned example for Pendulum-v1 and PPO

Cheers