I follow the following link to set up the hyperparameters for SAC to solve “CartPole-v0”, which is a very easy task. However, the mean rewards always remain about 10. Do I miss something?
https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/sac/cartpole-sac.yaml
import ray
#import ray.rllib.agents.ppo as ppo
import ray.rllib.agents.sac as sac
ray.init()
config = sac.DEFAULT_CONFIG.copy()
config[“num_gpus”] = 0
#config[“framework”] = “torch”#
config[“framework”] = “tf”
config[“no_done_at_end”] = “false”
config[“gamma”] = 0.95
config[“target_network_update_freq”] = 32
config[“tau”] = 1.0
config[“train_batch_size”] = 32
config[‘optimization’][‘actor_learning_rate’] = 0.005
config[‘optimization’][‘critic_learning_rate’] = 0.005
config[‘optimization’][‘entropy_learning_rate’] = 0.0001
#trainer = sac.SACTrainer(config=config, env=“MountainCar-v0”)
trainer = sac.SACTrainer(config=config, env=“CartPole-v0”)
for i in range(5000):
Perform one iteration of training the policy with PPO
result = trainer.train()
if i % 10 == 0:
#checkpoint = trainer.save()
print("i: “, i,” reward: ",result[‘episode_reward_mean’])
1 Like
Hi I took a look at your script, a few things:
-
you are passing string “false” to the parameter no_done_at_end. python interprets that as a True value. So the env is outputting episodes without done bit at the end and completely confuses the trainer.
-
you actually don’t need to init the config dict with DEFAULT_CONFIG yourself. RLlib will do it for you. E.g., the following script works:
import ray
import ray.rllib.agents.sac as sac
ray.init(local_mode=True)
config = {
'framework': 'tf',
'gamma': 0.95,
'no_done_at_end': False,
'target_network_update_freq': 32,
'tau': 1.0,
'train_batch_size': 32,
'optimization': {
'actor_learning_rate': 0.005,
'critic_learning_rate': 0.005,
'entropy_learning_rate': 0.0001
}
}
trainer = sac.SACTrainer(config=config, env="CartPole-v0")
for i in range(5000):
result = trainer.train()
if i % 10 == 0:
print("i: ", i, result["timesteps_total"], " reward: ", result["episode_reward_mean"])
- you can actually run the yaml file directly using:
rllib train -f rllib/tuned_examples/sac/cartpole-sac.yaml
saving you the error when copying over the configuration.
1 Like
It works. Thanks a lot!
However, I failed to apply the same hyperparameters to “MountainCar-v0”. Are there other important hyperparameters in SAC besides learning rate?
SAC doesn’t work particularly well when applied to discrete action spaces. I’d suggest using another algorithm such as PPO, which can be used with a categorical policy (has discrete outputs) for learning the MountainCar-v0
problem, or instead using an environment such as MountainCarContinous-v0
1 Like
Thanks for your suggestion.