the hyperparameters for SAC to solve “CartPole-v0”

I follow the following link to set up the hyperparameters for SAC to solve “CartPole-v0”, which is a very easy task. However, the mean rewards always remain about 10. Do I miss something?

import ray
#import ray.rllib.agents.ppo as ppo
import ray.rllib.agents.sac as sac

config = sac.DEFAULT_CONFIG.copy()
config[“num_gpus”] = 0
#config[“framework”] = “torch”#
config[“framework”] = “tf”
config[“no_done_at_end”] = “false”
config[“gamma”] = 0.95
config[“target_network_update_freq”] = 32
config[“tau”] = 1.0
config[“train_batch_size”] = 32
config[‘optimization’][‘actor_learning_rate’] = 0.005
config[‘optimization’][‘critic_learning_rate’] = 0.005
config[‘optimization’][‘entropy_learning_rate’] = 0.0001

#trainer = sac.SACTrainer(config=config, env=“MountainCar-v0”)
trainer = sac.SACTrainer(config=config, env=“CartPole-v0”)

for i in range(5000):

Perform one iteration of training the policy with PPO

result = trainer.train()

if i % 10 == 0:
#checkpoint =
print("i: “, i,” reward: ",result[‘episode_reward_mean’])

1 Like

Hi I took a look at your script, a few things:

  1. you are passing string “false” to the parameter no_done_at_end. python interprets that as a True value. So the env is outputting episodes without done bit at the end and completely confuses the trainer.

  2. you actually don’t need to init the config dict with DEFAULT_CONFIG yourself. RLlib will do it for you. E.g., the following script works:

import ray
import ray.rllib.agents.sac as sac


config = {
    'framework': 'tf',
    'gamma': 0.95,
    'no_done_at_end': False,
    'target_network_update_freq': 32,
    'tau': 1.0,
    'train_batch_size': 32,
    'optimization': {
        'actor_learning_rate': 0.005,
        'critic_learning_rate': 0.005,
        'entropy_learning_rate': 0.0001

trainer = sac.SACTrainer(config=config, env="CartPole-v0")

for i in range(5000):
    result = trainer.train()
    if i % 10 == 0:
        print("i: ", i, result["timesteps_total"], " reward: ", result["episode_reward_mean"])
  1. you can actually run the yaml file directly using:

rllib train -f rllib/tuned_examples/sac/cartpole-sac.yaml

saving you the error when copying over the configuration.

1 Like

It works. Thanks a lot!
However, I failed to apply the same hyperparameters to “MountainCar-v0”. Are there other important hyperparameters in SAC besides learning rate?

SAC doesn’t work particularly well when applied to discrete action spaces. I’d suggest using another algorithm such as PPO, which can be used with a categorical policy (has discrete outputs) for learning the MountainCar-v0 problem, or instead using an environment such as MountainCarContinous-v0

1 Like

Thanks for your suggestion.