Trying to use ray.tune.Tuner, ray.tune.search.optuna.OptunaSearch, ray.tune.schedulers.ASHAScheduler using Ray 2 to find the best hyper-parameters for a RLLib policy that maximizes mean reward while also performing early termination of bad trials.
Code snippet below highlights the current process, but that generates the following error:
ray.tune.error.TuneError: No trial resources are available for launching the actor ray.rllib.evaluation.rollout_worker.RolloutWorker.__init__. To resolve this, specify the Tune option: resources_per_trial=tune.PlacementGroupFactory([{‘CPU’: 1.0}] + [{‘CPU’: 1.0}] * N)
Tuning resources documentation (A Guide To Parallelism and Resources — Ray 2.0.0 ) provides an example for how to specify resources when passing a trainable to ray.tune.Tuner, but haven’t found documentation for how to do this when passing an objective function used by Optuna or some other hyper-parameter search algorithm during the hyper-parameter tuning optimization process.
def train_policy(self):
# create hyper-parameter search space
search_space = self.create_search_space()
# create search algorithm
algo = OptunaSearch(
metric=self.metric,
mode=self.mode
)
# create scheduler that enables aggressive early stopping of bad trials
scheduler = ASHAScheduler(...)
# create tuner
tuner = tune.Tuner(
# objective function that trains RLLib PPO policy using hyper-parameters selected by Optuna
self.objective,
# specify tune configuration
tune_config=tune.TuneConfig(
num_samples=self.num_samples,
search_alg=algo,
scheduler=scheduler
),
# specify run configuration
run_config=air.RunConfig(
stop=dict(training_iteration=self.num_train_iters),
verbose=3
),
# specify hyper-parameter search space
param_space=search_space
)
# run tuner
result_grid = tuner.fit()
def objective(self, config):
# create PPO trainer
trainer = self.create_ppo_trainer(config)
# iterate
for iter in range(self.num_train_iters):
# train policy
results = trainer.train()
# update tuner
session.report(dict(
episode_reward_mean=results[self.metric]
))
In another use case we’d like to also use Optuna to search for hyper-parameters when training an RLLib policy using offline data. To provide Optuna with feedback about the performance for each policy, we were planning to use the doubly robust estimation method described in RLLib documentation ray/rllib-offline.rst at master · ray-project/ray · GitHub. However, this requires that a separate fitted q-evaluation model with its own hyper-parameters is instantiated, and trained for each policy during the hyper-parameter tuning process. Is there a way to achieve this without specifying a custom objective function?
Here is a snippet of our current custom objective function:
def objective(self, config):
# create trainer
trainer = self.create_marwil_trainer(…)
# iterate
for iter in range(self.off_policy_train_info['num_train_iters']):
# train policy
trainer.train()
# evaluate policy using off-policy evaluation
v_behavior_list, v_target_list = self.evaluate_trainer(trainer)
# update tuner with estimated policy values
session.report(dict(
v_behavior= np.mean(v_behavior_list),
v_target= np.mean(v_target_list)
))
def evaluate_trainer(self, trainer):
# create doubly robust estimator
estimator = self.create_doubly_robust_estimator(
policy=trainer.get_policy(),
)
# iterate over batches of train data and train doubly robust fitted q-evaluation model
# evaluate policy using doubly robust method
# return off-policy estimates
return v_behavior_list, v_target_list
I’m not sure what you’re asking, but if the question is how is the environment specified then it is assigned in method self.create_marwil_trainer() as shown here:
def create_marwil_trainer(self, hp_parm_config):
# create config
config = (
MARWILConfig()
.training(
gamma=self.off_policy_train_info['gamma'],
beta=hp_parm_config['beta'],
lr=hp_parm_config['lr'])
.environment(env=self.env_name)
.framework('torch')
.offline_data(input_=self.offline_data_train_dir)
).to_dict()
# set fixed values
config['horizon'] = self.num_time_periods
...
# set hyper-parameter values selected by tuning process
config['observation_filter'] = hp_parm_config['observation_filter']
...
# create trainer
return MARWIL(config=config)