Specify trial resources when using Optuna search algorithm to tune hyper-parameters

Trying to use ray.tune.Tuner, ray.tune.search.optuna.OptunaSearch, ray.tune.schedulers.ASHAScheduler using Ray 2 to find the best hyper-parameters for a RLLib policy that maximizes mean reward while also performing early termination of bad trials.

Code snippet below highlights the current process, but that generates the following error:

ray.tune.error.TuneError: No trial resources are available for launching the actor ray.rllib.evaluation.rollout_worker.RolloutWorker.__init__. To resolve this, specify the Tune option: resources_per_trial=tune.PlacementGroupFactory([{‘CPU’: 1.0}] + [{‘CPU’: 1.0}] * N)

Tuning resources documentation (A Guide To Parallelism and Resources — Ray 2.0.0 ) provides an example for how to specify resources when passing a trainable to ray.tune.Tuner, but haven’t found documentation for how to do this when passing an objective function used by Optuna or some other hyper-parameter search algorithm during the hyper-parameter tuning optimization process.

def train_policy(self):

  # create hyper-parameter search space 
  search_space = self.create_search_space() 
  
  # create search algorithm 
  algo = OptunaSearch( 
      metric=self.metric, 
      mode=self.mode 
  ) 
  
  # create scheduler that enables aggressive early stopping of bad trials 
  scheduler = ASHAScheduler(...) 
  
  # create tuner 
  tuner = tune.Tuner( 
  
      # objective function that trains RLLib PPO policy using hyper-parameters selected by Optuna 
      self.objective, 
  
      # specify tune configuration 
      tune_config=tune.TuneConfig( 
          num_samples=self.num_samples, 
          search_alg=algo, 
          scheduler=scheduler 
      ), 
  
      # specify run configuration 
      run_config=air.RunConfig( 
          stop=dict(training_iteration=self.num_train_iters), 
          verbose=3 
      ), 
  
      # specify hyper-parameter search space 
      param_space=search_space 
  ) 
  
  # run tuner 
  result_grid = tuner.fit() 

def objective(self, config):

  # create PPO trainer 
  trainer = self.create_ppo_trainer(config) 

  # iterate 
  for iter in range(self.num_train_iters): 
  
      # train policy 
      results = trainer.train() 
  
      # update tuner 
      session.report(dict( 
          episode_reward_mean=results[self.metric] 
      )) 

Greatly appreciate any help.

Thanks,
Stefan

Hi Stefan,

Ray RLlib is fully integrated with Ray Tune - you don’t have to create your own objective function here.

E.g. this should work:

    tuner = tune.Tuner( 
  
      # objective function that trains RLLib PPO policy using hyper-parameters selected by Optuna 
      "PPO", 
  
      # specify tune configuration 
      tune_config=tune.TuneConfig( 
          num_samples=self.num_samples, 
          search_alg=algo, 
          scheduler=scheduler 
      ), 
  
      # specify run configuration 
      run_config=air.RunConfig( 
          stop=dict(training_iteration=self.num_train_iters), 
          verbose=3 
      ), 
  
      # specify hyper-parameter search space 
      param_space=search_space 
  ) 

This will automatically set all the required resources.

Do you do anything special in self.create_ppo_trainer?

Hi Kai.

Thank you, that works great.

In another use case we’d like to also use Optuna to search for hyper-parameters when training an RLLib policy using offline data. To provide Optuna with feedback about the performance for each policy, we were planning to use the doubly robust estimation method described in RLLib documentation ray/rllib-offline.rst at master · ray-project/ray · GitHub. However, this requires that a separate fitted q-evaluation model with its own hyper-parameters is instantiated, and trained for each policy during the hyper-parameter tuning process. Is there a way to achieve this without specifying a custom objective function?

Here is a snippet of our current custom objective function:

def objective(self, config):

# create trainer 
trainer = self.create_marwil_trainer(…) 

# iterate  
for iter in range(self.off_policy_train_info['num_train_iters']): 

    # train policy 
    trainer.train() 

    # evaluate policy using off-policy evaluation 
    v_behavior_list, v_target_list = self.evaluate_trainer(trainer) 
     
    # update tuner with estimated policy values       
    session.report(dict( 
        v_behavior= np.mean(v_behavior_list), 
        v_target= np.mean(v_target_list) 
    )) 

def evaluate_trainer(self, trainer):

# create doubly robust estimator 
estimator = self.create_doubly_robust_estimator( 
    policy=trainer.get_policy(), 
) 

# iterate over batches of train data and train doubly robust fitted q-evaluation model

# evaluate policy using doubly robust method

# return off-policy estimates
return v_behavior_list, v_target_list 

Thanks for your help,
Stefan