Ray Tune: exponential learning rate schedule on HyperOpt

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello all,

Background: I have successfully used the HyperOpt tuner to tune my PPO hyper-parameters for a project of mine but with static learning rates. After many hours of research I have found that not using a learning rate schedule (specifically exponential) will cause the PPO algorithm to eventually collapse in terms of rewards. When using PPOtrainer for one hyper-parameter setting, I basically just save the checkpoint data every 10 training batches, stop the training process, manually modify the “lr” in the config dict such that lr_new=lr_prev*0.99 and resume training from the last saved checkpoint and keep doing that for 4000-8000 training batch updates. The problem is that I can’t implement this manual workaround with HyperOpt because HyperOpt is not running each hyper-parameter setting sequentially, rather in parallel.

Question: is there a way to define in the HyperOpt config dict a tunable exponential learning rate schedule? essentially the exponential learning rate schedule would be defined by 2 parameters (a & b) and a variable N; lr=aexp(-bN) where a is the initial lr, b is the rate of decay and N is episodes/batches/steps etc.

Below is an example of my current raytune experiment object:

alg = HyperOptSearch(metric=“episode_reward_mean”, mode=“max”)
analysis = tune.run(
“PPO”,
stop={“episodes_total”: target},
metric=“episode_reward_mean”,
mode=“max”,
config = {
“env”: SELECT_ENV,
“num_workers”: 8,
“lr”: tune.uniform(1e-6,1e-5),
“gamma”: tune.uniform(0.9,0.99),
“lambda”: tune.uniform(0.9, 0.99),
“train_batch_size”: tune.uniform(2048,8192),
“num_gpus”: 0,
“model”: {“fcnet_hiddens”: tune.choice([[256,256],[512,512],[1024,1024]])}
},
search_alg=alg,
num_samples=20,

)

Thanks in advance

I think the question is more about does RLlib’s PPO trainer support taking in a “hyperparameterized” exponential decay lr-scheduler and if so, what is the expected format that Tune should give it.
HyperOpt or other searchers is not that relevant. At the end of the day, decay rate is just treated as one hyperparameter.

cc @kourosh to comment on PPO Trainer functionality.

1 Like

Yeh my bad, its less relevant to the search algorithm itself. Overall I would like my tuner to be able to tune decay rate of lr and initial lr for an exponential decay, not linear decay which apparently is what “lr_schedule” does if I am not mistaken.

Thankyou :smiley:

Hey @idaneliash,

My understanding is that you want to implement lr_schedule inside PPOAlgorithm. I looked at the code, and it seems like we currently only support piecewise-linear learning rate schedule for PPO. I highly encourage you to file a github issue with a feature request (and maybe tag me so I can keep track of it on the future RLlib feature wishlist).

After digging a little into LearningRateSchedule class in torch_mixins and callbacks API here is my proposal on how you can manually achieve this. Introduce a callback that will update the learning rate on the local_worker that trains the policy on every iteration.



import math
from ray.rllib.algorithms.callbacks import DefaultCallbacks
from ray.rllib.algorithms.ppo import PPOConfig

class LRDecayCallback(DefaultCallbacks):

    def on_train_result(
        self,
        *,
        algorithm,
        result: dict,
        **kwargs,
    ) -> None:

        iteration = algorithm.iteration
        local_worker = algorithm.workers.local_worker()
        # we will introduce this in a new config
        lr_decay = algorithm.config.lr_decay

        # for torch policy
        for pid in local_worker.policy_map:
            policy = local_worker.policy_map[pid]
            policy.cur_lr = policy.cur_lr * math.exp(-lr_decay * iteration)
            for opt in policy._optimizers:
                for p in opt.param_groups:
                    p["lr"] = policy.cur_lr

class CustomPPOConfig(PPOConfig):
    
    def __init__(self, algo_class=None):
        super().__init__(algo_class)
        self.lr_decay = None

    def training(self, lr_decay=None, **kwargs):
        self.lr_decay = lr_decay
        return super().training(**kwargs)


config = (
    CustomPPOConfig()
    .framework("torch")
    .environment("CartPole-v1")
    .training(lr_decay=0.01, lr=3e-3)
    .callbacks(LRDecayCallback)
)

algo = config.build()

for i in range(10):
    policy = algo.get_policy()
    cur_lr = policy.cur_lr
    print(f"Iteration {i}: lr = {cur_lr}")
    algo.train()

Once this is done, you can apply regular Tuner.fit() operation to this new config with tune search spaces for hyper-parameter tuning.

1 Like

Thankyou very much!! Unfortunately I am not experienced with torch and I am more experienced with tf2. I guess I’ll try work with torch or maybe create something similar with tf2.

@idaneliash Then you gotta repeat what is under TF’s learning rate scheduler under tf_mixins. tf_mixins.py - ray-project/ray - Sourcegraph

Something like: optimizer.learning_rate.assign(cur_lr)

1 Like

Hey again! I’ve successfully implemented the above using the torch framework but for ray train. On a side note:

policy.cur_lr = policy.cur_lr * math.exp(-lr_decay * iteration)

should actually be:

policy.cur_lr = policy.init_lr * math.exp(-lr_decay * iteration)

Anyhow, i am having trouble defining the param_space for ray tune. I’ve tried implementing in the following manner:

SELECT_ENV = “helienv-v0”
register_env(SELECT_ENV, lambda config: HelicopterEnv())

config = CustomPPOConfig().callbacks(LRDecayCallback).environment(“helienv-v0”)

config.framework_str = “torch”
config.num_gpus = 0
config.num_rollout_workers = 8

config.lr = tune.loguniform(1e-6,1e-5)

config.gamma = tune.uniform(0.9,0.99),
config.lambda_ = tune.uniform(0.9, 0.99)
config.train_batch_size = tune.uniform(2048,8192)
config.num_gpus = 0
config.model[“fcnet_hiddens”] = tune.choice([[256,256],[512,512],[1024,1024]])
config.training(lr_decay=tune.loguniform(1e-4, 1e-1),lr=tune.loguniform(1e-6, 1e-5))

target = 1e5
alg = HyperOptSearch(metric=“episode_reward_mean”, mode=“max”)
analysis = tune.run(
“PPO”,
stop={“episodes_total”: target},
metric=“episode_reward_mean”,
mode=“max”,
config = config,
search_alg=alg,
num_samples=20,

)
print("best hyperparameters: ", analysis.best_config)

But the callback doesn’t know how to handle inputs to the “training” method as tune.loguniform objects, and that’s exactly what I’m trying to accomplish here… So for clarification in your example you wrote “.training(lr_decay=0.01, lr=3e-3)” which obviously works because you input floats into the training method, but if i want to optimize the “lr” and “lr_decay” variables which must go into the training method, how would i do that?