I follow the following link to set up the hyperparameters for SAC to solve “CartPole-v0”, which is a very easy task. However, the mean rewards always remain about 10. Do I miss something?
the code:
import ray
#import ray.rllib.agents.ppo as ppo
import ray.rllib.agents.sac as sac
ray.init()
config = sac.DEFAULT_CONFIG.copy()
config[“num_gpus”] = 0
#config[“framework”] = “torch”#
config[“framework”] = “tf”
config[“no_done_at_end”] = False
config[“gamma”] = 0.95
config[“target_network_update_freq”] = 32
config[“tau”] = 1.0
config[“train_batch_size”] = 32
config[“optimization”][“actor_learning_rate”] = 0.005
config[“optimization”][“critic_learning_rate”] = 0.005
config[“optimization”][“entropy_learning_rate”] = 0.0001
#trainer = sac.SACTrainer(config=config, env=“MountainCar-v0”)
trainer = sac.SACTrainer(config=config, env=“CartPole-v0”)
for i in range(5000):
result = trainer.train()
if i % 10 == 0:
#checkpoint = trainer.save()
print("i: “, i,” reward: ",result[‘episode_reward_mean’])