Correct implementation for PPO reset_config()

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi,

After implementing the reset_config() method for PPO and running PB2 with reuse_actors=True with Pendulum-v1, it gives this error:

2024-04-05 18:04:17,154	ERROR tune_controller.py:1374 -- Trial task failed for trial PPO_Pendulum-v1_444a6_00014
Traceback (most recent call last):
  File "/home/.conda/envs/.../lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/.conda/envs/.../lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/.conda/envs/.../lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/.conda/envs/.../lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): e[36mray::PPO.reset()e[39m (pid=353467, ip=172.21.1.20, actor_id=5317424354362453bd5550a601000000, repr=PPO)
  File "/home/.conda/envs/.../lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 640, in reset
    success = self.reset_config(new_config)
  File "/home/.conda/envs/.../lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py", line 606, in reset_config
    self.config._enable_new_api_stack = new_config._enable_new_api_stack
AttributeError: 'dict' object has no attribute '_enable_new_api_stack'

If I set reuse_actors = False under the PB2 configs, this error doesn’t show up, so I assume that maybe I don’t have a correct implementation of reset_config()? I added it under the class PPO(Algorithm): under the ray library in my conda env:

def reset_config(self, new_config):
    self.config = new_config
    return True

Versions / Dependencies

python=Python 3.10.14
ray-rllib=2.9.3

Reproduction script

The PPO class:

class PPO(Algorithm):
    ...
    def reset_config(self, new_config):
        self.config = new_config
        return True

script:

import ray
from ray import train, tune
from ray.tune.tuner import Tuner
from ray.tune.registry import register_env
from ray.rllib.algorithms.ppo import PPO
from ray.tune.schedulers.pb2 import PB2
import argparse
import os
import numpy as np
import pandas as pd

def explore(config):
    if config["train_batch_size"] < config["sgd_minibatch_size"] * 2:
        config["train_batch_size"] = config["sgd_minibatch_size"] * 2
    config["train_batch_size"] = int(config["train_batch_size"])
    return config

pb2 = PB2(
        time_attr="timesteps_total",
        metric="episode_reward_mean",
        mode="max",
        perturbation_interval=50000,
        # Specifies the hyperparam search space
        hyperparam_bounds={
            'gamma': [0.9, 0.9999],
            'lambda_': [0.8, 1.0],
            'kl_coeff': [0.3, 2.0],
            'kl_target': [0.001, 0.1],
            'sgd_minibatch_size': [8, 2048],
            'num_sgd_iter': [1, 20],
            'vf_loss_coeff': [0, 1],
            'entropy_coeff': [1e-8, 0.1],
            'clip_param': [0.1, 0.5],
            'lr': [1e-5, 1],
            'train_batch_size': [2048, 65536],
            'vf_clip_param': [1, 10],
            'grad_clip': [0.3, 5],
        },
        quantile_fraction=0.25,  # copy bottom % with top %
        custom_explore_fn = explore,
    )

tuner = Tuner(
    trainable = PPO,
    param_space = {
        'env': 'Pendulum-v1,
        'model': {'free_log_std': True},
        'horizon': 100, # length of MDP
        'gamma': tune.choice([0.9, 0.95, 0.98, 0.99, 0.995, 0.999, 0.9999]),
        'lambda_': tune.choice([0.8, 0.9, 0.92, 0.95, 0.98, 0.99, 1.0]),
        'kl_coeff': tune.choice([0.3, 0.5, 1, 1.5, 2.0]),
        'kl_target': tune.choice([0.1, 0.05, 0.03, 0.02, 0.01, 0.005, 0.001]),
        'sgd_minibatch_size': tune.choice([8, 16, 32, 64, 128, 256, 512, 1024, 2048]),
        'num_sgd_iter': tune.choice([1, 5, 10, 20]),
        'vf_loss_coeff': tune.uniform(0, 1),
        'entropy_coeff': tune.loguniform(1e-8, 0.1),
        'clip_param': tune.choice([0.1, 0.2, 0.3, 0.4, 0.5]),
        'lr': tune.loguniform(1e-5, 1),
        'train_batch_size': tune.choice([2048, 4096, 8192, 16384, 32768, 65536]),
        'vf_clip_param': tune.choice([1, 2, 3, 5, 10]),
        'grad_clip': tune.choice([0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 5]),
    },
    tune_config=tune.TuneConfig(
        num_samples=10, # number of trials
        scheduler=pb2,
        reuse_actors=True,
        ),
    run_config=train.RunConfig(
        name='test',
        ),
)
results = tuner.fit()

print("best hyperparameters: ", results.get_best_result().config)

Bumping this, anyone have any ideas?

While a bit dated maybe it helps someone.

There are “two kinds” of config and two interfaces Trainable and Algorithm (which is also a Trainable). The two types of configs are the AlgorithmConfig that you want to work with and the more general trial.config that is sampled by the Tuner and passed to the reset_config method. The trial.config is a dict, hence in your reset_config you just receive a dict and no AlgorithmConfig. Check out Algorithm.__init__ how to transform the dict to a valid AlgorithmConfig.

A cheap reset (might not recommended) would be

def reset_config(self, new_config):
     # possibly self.stop() first
     self.__init__(new_config)
     return True

This will recreate all env_runners, might recreate some stuff without cleaning it. It will be correct but not efficient.

Note that self.config = new_created_algo_config is also not enough as the learner and EnvRunners have copies of the original config. So it is quite a bit to take care of.
Depending on your use case you need to individually address the parts that need resetting: Change env on EnvRunner, change training parameters on the Learner (or EnvRunner), …

When you use PBT you also need to check if the reset_config is not overwritten by load_checkpoint.

You’re correct: in Ray RLlib, there are two config types—AlgorithmConfig (the type-safe, validated config you use to build algorithms) and the plain dict config (trial.config) that Ray Tune samples and passes to reset_config. The reset_config(self, new_config) method receives a dict, not an AlgorithmConfig, so you must convert it if you want to use the richer API. The Algorithm.__init__ method shows how to convert a dict config into an AlgorithmConfig using update_from_dict or from_state as appropriate. Simply assigning self.config = new_config is not enough, since subcomponents like Learner and EnvRunners may have their own config copies that also need updating. A full re-init (self.__init__(new_config)) is correct but inefficient, as it may recreate resources unnecessarily; a more targeted approach is to update only the necessary subcomponents.

For efficient and correct behavior, you should update the relevant parts of the algorithm (e.g., call methods to update the environment on EnvRunners, or update training parameters on the Learner) rather than re-initializing the whole object. Also, be aware that in Population Based Training (PBT), reset_config may be called after load_checkpoint, so ensure your logic accounts for this. For more, see the AlgorithmConfig API docs and the Algorithm class source.

Would you like a step-by-step example of how to implement a custom reset_config that updates only specific parts of your RLlib Algorithm?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM is reset_config really called after restore and load_checkpoint? Or the other way round? Think before you answer and look for sources.

According to the Ray Tune Trainable implementation, reset_config is called after restore (which internally calls load_checkpoint). Specifically, when a trial is reset (e.g., for PBT or hyperparameter mutation with reuse_actors=True), the sequence is: first, the checkpoint is restored (via restore/load_checkpoint), then reset_config is called to update the config without restarting the process. This order is confirmed in the trainable.py source, where reset calls reset_config after restoring from checkpoint.

Would you like a detailed breakdown of the relevant code flow?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Yes, provide a breakdown. Because I cannot see this logic in the trainable.py source, is this really handled there or in other parts of ray, maybe the TuneController?

It seems likely that the logic for calling reset_config after restore/load_checkpoint is not handled directly in trainable.py, but rather in higher-level orchestration code, such as the TuneController or scheduler logic. In trainable.py, the reset method does call reset_config, but the actual sequence of restore and reset_config during trial resumption or PBT is managed by the trial runner/controller, not the Trainable itself. This is supported by the fact that the orchestration of trial state transitions (including restore and config mutation) is handled in the TuneController and related classes, as seen in TuneController and scheduler code and HyperBand scheduler.

If you want a step-by-step trace of where and how these calls are orchestrated, I can walk through the relevant controller and scheduler code.

Sources:

Hint: Mention @RunLLM in the post for followups.

So the correct flow is first reset the load_checkpoint…

in TuneController:

Reset

  • _maybe_add_actors / _schedule_trial_actor /
    • _maybe_reuse_cached_actor
      • _schedule_trial_reset
        • schedules: reset
        • on_result=_on_trial_reset

Restore

  • _on_trial_reset # ← executed after “reset” returned result
    • _actor_started
      • _schedule_trial_restore
        • schedules restore

Better to check yourself than on rely on the LLM :confused: