Dynamic Entropy Schedule

I was wondering if there is a way to have a dynamic Entropy Schedule that does not have to be defined before the training starts but can be adjusted dynamically at runtime based on arbitrary metrics.

I experimented a bit and came up with the following minimal example script for CartPole-v1 with PPO:

from types import MethodType

import ray
import torch
from ray.rllib.algorithms.algorithm_config import AlgorithmConfig
from ray.rllib.algorithms.callbacks import DefaultCallbacks
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.core.rl_module.multi_rl_module import MultiRLModuleSpec
from ray.rllib.core.rl_module.rl_module import RLModule, RLModuleSpec


class EntropyScheduleCallback(DefaultCallbacks):
    def on_train_result(self, *, algorithm, result, **kwargs):
        mean_reward = result["env_runners"]["episode_return_mean"]
        if mean_reward is None:
            return

        if mean_reward > 450:
            new_entropy_coeff = 0.01
        elif mean_reward > 300:
            new_entropy_coeff = 0.05
        else:
            new_entropy_coeff = 0.1

        print(f"[Callback] Scheduling entropy coeff = {new_entropy_coeff}")

        learner = algorithm.learner_group._learner
        learner.set_entropy_coeff(new_entropy_coeff)


def custom_learner_factory(
    config: AlgorithmConfig,
    module_spec: RLModuleSpec | MultiRLModuleSpec | None = None,
    module: RLModule | None = None,
):
    learner_cls = config.get_default_learner_class()

    base_learner = learner_cls(config=config, module_spec=module_spec, module=module)

    def set_entropy_coeff(self, new_value: float):
        for module_id, scheduler in getattr(
            self, "entropy_coeff_schedulers_per_module", {}
        ).items():
            scheduler._curr_value = torch.tensor(float(new_value))

        setattr(self, "_last_set_entropy_coeff", float(new_value))

        return True

    base_learner.set_entropy_coeff = MethodType(set_entropy_coeff, base_learner)

    return base_learner


def main():
    ray.init(runtime_env=({"env_vars": {"RAY_DEBUG": "1"}}))

    config = (
        PPOConfig()
        .environment(env="CartPole-v1")
        .env_runners(num_env_runners=7)
        .framework("torch")
        .training(learner_class=custom_learner_factory)
        .callbacks(EntropyScheduleCallback)
    )

    algo = config.build()

    for i in range(50):
        result = algo.train()
        print(
            f"Iteration {i + 1}: episode_reward_mean={result['env_runners']['episode_return_mean']}"
        )
        print(
            f"entropy coefficient: {result['learners']['default_policy']['curr_entropy_coeff']}"
        )

    algo.stop()
    ray.shutdown()


if __name__ == "__main__":
    main()

Taking advantage of the new learner API, I patch a entropy coefficient setter method in the PPOLearner class. The dynamic entropy schedule is implemented in a callback function which invokes this patched setter method. Finally, the actual entropy is changed in a bit of a hacky way by “hijacking” the entropy scheduler and changing the (private) entropy coefficient directly.

I am posting this because I think it might be helpful to some and because I was wondering if such functionality is planned in future ray versions. Also this concept could be generalized to all kinds of hyperparameters but I haven’t found a better way than “hijacking” the scheduler so far.

I need this functionality because I am doing curriculum learning and it might be beneficial to dynamically adjust the entropy coefficient to get out of local optima during the curriculum progression.

1 Like

It seems likely that RLlib does not natively support dynamic, runtime adjustment of the entropy coefficient via a public API, but your approach of patching a setter into the Learner and updating the scheduler’s internal value is a practical workaround. The entropy coefficient in PPO is managed by a Scheduler object per module, and the value is typically set at initialization or via a static schedule, not dynamically at runtime based on arbitrary metrics. Your method of directly modifying the scheduler’s internal state is not officially documented or supported, but it aligns with the internal structure of RLlib’s PPO implementation, where entropy coefficients are stored in entropy_coeff_schedulers_per_module and updated after each training step using the scheduler’s update() method, not via an external setter (see here).

There is no mention in the official sources of a built-in, general-purpose API for dynamically adjusting arbitrary hyperparameters (like entropy coefficient) at runtime based on custom metrics, nor is there a public method for setting the entropy coefficient during training. The only officially supported way to schedule such parameters is via static schedules defined before training starts (see here). Your approach is a reasonable hack, but it is not guaranteed to be stable across RLlib versions. If you want this feature to be officially supported, it may be worth filing a feature request with the Ray team (see related discussion).

Would you like a step-by-step breakdown of how your workaround interacts with RLlib internals and possible alternatives?

Sources:

Hint: Mention @RunLLM in the post for followups.

Hi again, is there anyone from the ray team that can have a look at this issue? Is this something you are planning to add in the future? Thanks in advance for any replies :slight_smile: