PPO not able to learn from an env as simple as Pendulum-v1?

zhh210 · March 10, 2023, 7:37pm

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity

Observing it is hard to find from Ray document a full runnable example training on an environment with continuous action-space. I tried to run it on the simplest env Pendulum-v1. However, the PPO model doesn’t seem to be learning from the env. Is this behavior expected?

My testing code running with Ray==2.1.0 and Python==3.8 on Sagemaker:

import ray
from ray import air, tune

tuner = tune.Tuner(
    'PPO',
    tune_config=tune.TuneConfig(
            metric="episode_reward_mean",
            mode="max",
            num_samples=1,
            #max_concurrent_trials=1,
        time_budget_s=60*60*8
        ),
    run_config=air.RunConfig(
        #stop={"time_total_s": 60*60*8},
        checkpoint_config=air.CheckpointConfig(
            checkpoint_at_end=True,
            checkpoint_frequency=200,
            num_to_keep = 5),
        local_dir = '/home/ec2-user/SageMaker/ray_results2',
        callbacks=[MLflowLoggerCallback(tracking_uri='http://localhost:5000', experiment_name="dev", )]
    ),
    
    param_space={
            "env": "Pendulum-v1",
#             "clip_actions": True,
            "num_workers": 4,
            "num_gpus": 0, # number of GPUs to use
            "framework": 'torch', # tf|tf2|torch
        },
    
)

tuner.fit()

Also tried on stable baseline 3 which seems to be working fine:

from stable_baselines3 import PPO
import gym
model = PPO(
#     'MultiInputPolicy',
    'MlpPolicy',
    gym.make('Pendulum-v1'),
    gamma=0.98,
    # Using https://proceedings.mlr.press/v164/raffin22a.html
    use_sde=True,
    sde_sample_freq=4,
    learning_rate=1e-3,
    verbose=1,
)

# Train the agent
model.learn(total_timesteps=int(1e6))

It’s achieving mean_reward of about -200 after running for a few minutes:

arturn · March 22, 2023, 5:20am

Hi,

RLlib does not automatically tune hyperparameters.
This is a critical part of reinforcement learning though.
If you would use Ray tune to tune some of the more important ones in PPO, you will get good results.
Have a look at our tuned examples if you are looking for such configurations.
For example, here is our tuned example for Pendulum-v1 and PPO

Cheers

Topic		Replies	Views
Ray Tune and Ray RLLIB RLlib	1	196	April 14, 2023
Unexpected dramatic drop in reward RLlib	8	966	November 13, 2023
PPO.train incorrect result RLlib	1	260	May 23, 2023
Unable to replicate original PPO performance RLlib	0	177	May 10, 2024
PPO does not seem to be learning anything	0	456	November 9, 2022

PPO not able to learn from an env as simple as Pendulum-v1?

Related topics