How severe does this issue affect your experience of using Ray?
- None: Just asking a question out of curiosity
Observing it is hard to find from Ray document a full runnable example training on an environment with continuous action-space. I tried to run it on the simplest env Pendulum-v1. However, the PPO model doesn’t seem to be learning from the env. Is this behavior expected?
My testing code running with Ray==2.1.0 and Python==3.8 on Sagemaker:
import ray
from ray import air, tune
tuner = tune.Tuner(
'PPO',
tune_config=tune.TuneConfig(
metric="episode_reward_mean",
mode="max",
num_samples=1,
#max_concurrent_trials=1,
time_budget_s=60*60*8
),
run_config=air.RunConfig(
#stop={"time_total_s": 60*60*8},
checkpoint_config=air.CheckpointConfig(
checkpoint_at_end=True,
checkpoint_frequency=200,
num_to_keep = 5),
local_dir = '/home/ec2-user/SageMaker/ray_results2',
callbacks=[MLflowLoggerCallback(tracking_uri='http://localhost:5000', experiment_name="dev", )]
),
param_space={
"env": "Pendulum-v1",
# "clip_actions": True,
"num_workers": 4,
"num_gpus": 0, # number of GPUs to use
"framework": 'torch', # tf|tf2|torch
},
)
tuner.fit()
Also tried on stable baseline 3 which seems to be working fine:
from stable_baselines3 import PPO
import gym
model = PPO(
# 'MultiInputPolicy',
'MlpPolicy',
gym.make('Pendulum-v1'),
gamma=0.98,
# Using https://proceedings.mlr.press/v164/raffin22a.html
use_sde=True,
sde_sample_freq=4,
learning_rate=1e-3,
verbose=1,
)
# Train the agent
model.learn(total_timesteps=int(1e6))
It’s achieving mean_reward of about -200 after running for a few minutes: