Evaluate a model after hyperparameters research algorithm

How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.

Hi,

I recently worked on a hyperparameters optimization with a search algorithm.

The purpose is to train an agent in an OpenAI Gym environment.

The problem is the following one : when I realize a hyperparameters optimization with a hyperOpt algorithm from ray.tune, it returns me a best config with same parameters several times in this configuration. Furthermore, I cannot use this best configuration to run a unit training. I deduced there was a problem.

I show you below my code below :

config = {
             "env": "LunarLander-v2",
             "sgd_minibatch_size": 1000,
             "num_sgd_iter": 1000,
             "lr": tune.uniform(5e-6, 5e-2),
             "lambda": tune.uniform(0.6, 0.9),
             "vf_loss_coeff": 0.7,
             "kl_target": 0.01,
             "kl_coeff": tune.uniform(0.5, 0.9),
             "entropy_coeff": 0.001,
             "clip_param": tune.uniform(0.4, 0.99),
             "train_batch_size": 25000, # taille de l'épisode
             # "monitor": True,
             # "model": {"free_log_std": True},
             "num_workers": 4,
             "num_gpus": 0,
             # "rollout_fragment_length":3
             # "batch_mode": "complete_episodes"
         }


config = explore(config)
optimizer = HyperOptSearch(metric="episode_reward_mean", mode="max", n_initial_points=1, random_state_seed=7, space=config)
tuner = tune.Tuner(
    "PPO",
    tune_config=tune.TuneConfig(
        metric="episode_reward_mean",  # the metric we want to study
        mode="max",  # maximize the metric
        search_alg=optimizer,
        # num_samples will repeat the entire config 'num_samples' times == Number of trials dans l'output 'Status'
        num_samples=1,
    ),
    run_config=air.RunConfig(stop={"training_iteration": 1}),
    # limite le nombre d'épisode pour chaque croisement d'hyperparamètres

)
results = tuner.fit()

best_conf=results.get_best_result().config

print(f"\n ##############################################\n Meilleure configuration : {best_conf}\n ##############################################\n")

The problem is that I cannot use best_conf to make a simple unit training (eventually to compute_single_actions() on my agent and visualize the render). The format of the best_conf dictionnary is not the same because several params are written twice or three times.

I found this on the ray.tune documentation, but I don’t understand how could I adapt it to my case (ie what is my model) ? The link : Getting Started — Ray 2.2.0

The code :

import os

logdir = results.get_best_result("mean_accuracy", mode="max").log_dir
state_dict = torch.load(os.path.join(logdir, "model.pth"))

model = ConvNet()
model.load_state_dict(state_dict)

Maybe my strategy to get the best_conf and use it on unit training is not the good one. I’m listening to your ideas in this case :slight_smile:

Thank you by advance

@arturn do you happen to know if this behavior is caused by RLlib? AFAIK get_best_result().config shouldn’t contain duplicates.

Hi @clement2802 ,

The config should indeed not contain duplicates.
Could you please post the error you are receiving?
Where does it occur?

Hi @arturn ,

here is the exeption I have when I make a PPOTrainer(best_config), where best_config is the best conf of my hyperOptSearch algorithm :

Traceback (most recent call last):
File “/home/cytech/Python/Nav/affecta/LunarLander_tune.PY”, line 340, in
unit_ppo(best_conf, 5, checkpoint)
File “/home/cytech/Python/Nav/affecta/LunarLander_tune.PY”, line 91, in unit_ppo
rllib_trainer = PPOTrainer(config=config)
File “/home/cytech/anaconda3/envs/IA2/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py”, line 308, in init
super().init(config=config, logger_creator=logger_creator, **kwargs)
File “/home/cytech/anaconda3/envs/IA2/lib/python3.10/site-packages/ray/tune/trainable/trainable.py”, line 157, in init
self.setup(copy.deepcopy(self.config))
File “/home/cytech/anaconda3/envs/IA2/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py”, line 477, in setup
assert (
AssertionError

It seems that que PPOTrainer in my unit_ppo method cannot use the best_conf.

I don’t know it this could help us, but here is the best_conf with duplicated params :

Best configuration : {‘extra_python_environs_for_driver’: {}, ‘extra_python_environs_for_worker’: {}, ‘num_gpus’: 0, ‘num_cpus_per_worker’: 1, ‘num_gpus_per_worker’: 0, ‘_fake_gpus’: False, ‘custom_resources_per_worker’: {}, ‘placement_strategy’: ‘PACK’, ‘eager_tracing’: False, ‘eager_max_retraces’: 20, ‘tf_session_args’: {‘intra_op_parallelism_threads’: 2, ‘inter_op_parallelism_threads’: 2, ‘gpu_options’: {‘allow_growth’: True}, ‘log_device_placement’: False, ‘device_count’: {‘CPU’: 1}, ‘allow_soft_placement’: True}, ‘local_tf_session_args’: {‘intra_op_parallelism_threads’: 8, ‘inter_op_parallelism_threads’: 8}, ‘env’: ‘LunarLander-v2’, ‘env_config’: {}, ‘observation_space’: None, ‘action_space’: None, ‘env_task_fn’: None, ‘render_env’: False, ‘clip_rewards’: None, ‘normalize_actions’: True, ‘clip_actions’: False, ‘disable_env_checking’: False, ‘num_workers’: 4, ‘num_envs_per_worker’: 1, ‘sample_collector’: <class ‘ray.rllib.evaluation.collectors.simple_list_collector.SimpleListCollector’>, ‘sample_async’: False, ‘enable_connectors’: False, ‘rollout_fragment_length’: 6250, ‘batch_mode’: ‘truncate_episodes’, ‘remote_worker_envs’: False, ‘remote_env_batch_wait_ms’: 0, ‘validate_workers_after_construction’: True, ‘ignore_worker_failures’: False, ‘recreate_failed_workers’: False, ‘restart_failed_sub_environments’: False, ‘num_consecutive_worker_failures_tolerance’: 100, ‘horizon’: None, ‘soft_horizon’: False, ‘no_done_at_end’: False, ‘preprocessor_pref’: ‘deepmind’, ‘observation_filter’: ‘NoFilter’, ‘synchronize_filters’: True, ‘compress_observations’: False, ‘enable_tf1_exec_eagerly’: False, ‘sampler_perf_stats_ema_coef’: None, ‘gamma’: 0.99, ‘lr’: 0.03346975115973727, ‘train_batch_size’: 25000, ‘model’: {‘_use_default_native_models’: False, ‘_disable_preprocessor_api’: False, ‘_disable_action_flattening’: False, ‘fcnet_hiddens’: [256, 256], ‘fcnet_activation’: ‘tanh’, ‘conv_filters’: None, ‘conv_activation’: ‘relu’, ‘post_fcnet_hiddens’: , ‘post_fcnet_activation’: ‘relu’, ‘free_log_std’: False, ‘no_final_linear’: False, ‘vf_share_layers’: False, ‘use_lstm’: False, ‘max_seq_len’: 20, ‘lstm_cell_size’: 256, ‘lstm_use_prev_action’: False, ‘lstm_use_prev_reward’: False, ‘_time_major’: False, ‘use_attention’: False, ‘attention_num_transformer_units’: 1, ‘attention_dim’: 64, ‘attention_num_heads’: 1, ‘attention_head_dim’: 32, ‘attention_memory_inference’: 50, ‘attention_memory_training’: 50, ‘attention_position_wise_mlp_dim’: 32, ‘attention_init_gru_gate_bias’: 2.0, ‘attention_use_n_prev_actions’: 0, ‘attention_use_n_prev_rewards’: 0, ‘framestack’: True, ‘dim’: 84, ‘grayscale’: False, ‘zero_mean’: True, ‘custom_model’: None, ‘custom_model_config’: {}, ‘custom_action_dist’: None, ‘custom_preprocessor’: None, ‘lstm_use_prev_action_reward’: -1}, ‘optimizer’: {}, ‘explore’: True, ‘exploration_config’: {‘type’: ‘StochasticSampling’}, ‘input_config’: {}, ‘actions_in_input_normalized’: False, ‘postprocess_inputs’: False, ‘shuffle_buffer_size’: 0, ‘output’: None, ‘output_config’: {}, ‘output_compress_columns’: [‘obs’, ‘new_obs’], ‘output_max_file_size’: 67108864, ‘evaluation_interval’: None, ‘evaluation_duration’: 10, ‘evaluation_duration_unit’: ‘episodes’, ‘evaluation_sample_timeout_s’: 180.0, ‘evaluation_parallel_to_training’: False, ‘evaluation_config’: {‘extra_python_environs_for_driver’: {}, ‘extra_python_environs_for_worker’: {}, ‘num_gpus’: 0, ‘num_cpus_per_worker’: 1, ‘num_gpus_per_worker’: 0, ‘_fake_gpus’: False, ‘custom_resources_per_worker’: {}, ‘placement_strategy’: ‘PACK’, ‘eager_tracing’: False, ‘eager_max_retraces’: 20, ‘tf_session_args’: {‘intra_op_parallelism_threads’: 2, ‘inter_op_parallelism_threads’: 2, ‘gpu_options’: {‘allow_growth’: True}, ‘log_device_placement’: False, ‘device_count’: {‘CPU’: 1}, ‘allow_soft_placement’: True}, ‘local_tf_session_args’: {‘intra_op_parallelism_threads’: 8, ‘inter_op_parallelism_threads’: 8}, ‘env’: ‘LunarLander-v2’, ‘env_config’: {}, ‘observation_space’: None, ‘action_space’: None, ‘env_task_fn’: None, ‘render_env’: False, ‘clip_rewards’: None, ‘normalize_actions’: True, ‘clip_actions’: False, ‘disable_env_checking’: False, ‘num_workers’: 4, ‘num_envs_per_worker’: 1, ‘sample_collector’: <class ‘ray.rllib.evaluation.collectors.simple_list_collector.SimpleListCollector’>, ‘sample_async’: False, ‘enable_connectors’: False, ‘rollout_fragment_length’: 6250, ‘batch_mode’: ‘truncate_episodes’, ‘remote_worker_envs’: False, ‘remote_env_batch_wait_ms’: 0, ‘validate_workers_after_construction’: True, ‘ignore_worker_failures’: False, ‘recreate_failed_workers’: False, ‘restart_failed_sub_environments’: False, ‘num_consecutive_worker_failures_tolerance’: 100, ‘horizon’: None, ‘soft_horizon’: False, ‘no_done_at_end’: False, ‘preprocessor_pref’: ‘deepmind’, ‘observation_filter’: ‘NoFilter’, ‘synchronize_filters’: True, ‘compress_observations’: False, ‘enable_tf1_exec_eagerly’: False, ‘sampler_perf_stats_ema_coef’: None, ‘gamma’: 0.99, ‘lr’: 0.03346975115973727, ‘train_batch_size’: 25000, ‘model’: {‘_use_default_native_models’: False, ‘_disable_preprocessor_api’: False, ‘_disable_action_flattening’: False, ‘fcnet_hiddens’: [256, 256], ‘fcnet_activation’: ‘tanh’, ‘conv_filters’: None, ‘conv_activation’: ‘relu’, ‘post_fcnet_hiddens’: , ‘post_fcnet_activation’: ‘relu’, ‘free_log_std’: False, ‘no_final_linear’: False, ‘vf_share_layers’: False, ‘use_lstm’: False, ‘max_seq_len’: 20, ‘lstm_cell_size’: 256, ‘lstm_use_prev_action’: False, ‘lstm_use_prev_reward’: False, ‘_time_major’: False, ‘use_attention’: False, ‘attention_num_transformer_units’: 1, ‘attention_dim’: 64, ‘attention_num_heads’: 1, ‘attention_head_dim’: 32, ‘attention_memory_inference’: 50, ‘attention_memory_training’: 50, ‘attention_position_wise_mlp_dim’: 32, ‘attention_init_gru_gate_bias’: 2.0, ‘attention_use_n_prev_actions’: 0, ‘attention_use_n_prev_rewards’: 0, ‘framestack’: True, ‘dim’: 84, ‘grayscale’: False, ‘zero_mean’: True, ‘custom_model’: None, ‘custom_model_config’: {}, ‘custom_action_dist’: None, ‘custom_preprocessor’: None, ‘lstm_use_prev_action_reward’: -1}, ‘optimizer’: {}, ‘explore’: True, ‘exploration_config’: {‘type’: ‘StochasticSampling’}, ‘input_config’: {}, ‘actions_in_input_normalized’: False, ‘postprocess_inputs’: False, ‘shuffle_buffer_size’: 0, ‘output’: None, ‘output_config’: {}, ‘output_compress_columns’: [‘obs’, ‘new_obs’], ‘output_max_file_size’: 67108864, ‘evaluation_interval’: None, ‘evaluation_duration’: 10, ‘evaluation_duration_unit’: ‘episodes’, ‘evaluation_sample_timeout_s’: 180.0, ‘evaluation_parallel_to_training’: False, ‘evaluation_config’: {}, ‘off_policy_estimation_methods’: {}, ‘evaluation_num_workers’: 0, ‘always_attach_evaluation_results’: False, ‘in_evaluation’: False, ‘sync_filters_on_rollout_workers_timeout_s’: 60.0, ‘keep_per_episode_custom_metrics’: False, ‘metrics_episode_collection_timeout_s’: 60.0, ‘metrics_num_episodes_for_smoothing’: 100, ‘min_time_s_per_iteration’: None, ‘min_train_timesteps_per_iteration’: 0, ‘min_sample_timesteps_per_iteration’: 0, ‘logger_creator’: None, ‘logger_config’: None, ‘log_level’: ‘WARN’, ‘log_sys_usage’: True, ‘fake_sampler’: False, ‘seed’: None, ‘_tf_policy_handles_more_than_one_loss’: False, ‘_disable_preprocessor_api’: False, ‘_disable_action_flattening’: False, ‘_disable_execution_plan_api’: True, ‘simple_optimizer’: False, ‘monitor’: -1, ‘evaluation_num_episodes’: -1, ‘metrics_smoothing_episodes’: -1, ‘timesteps_per_iteration’: -1, ‘min_iter_time_s’: -1, ‘collect_metrics_timeout’: -1, ‘buffer_size’: -1, ‘prioritized_replay’: -1, ‘learning_starts’: -1, ‘replay_batch_size’: -1, ‘replay_sequence_length’: None, ‘prioritized_replay_alpha’: -1, ‘prioritized_replay_beta’: -1, ‘prioritized_replay_eps’: -1, ‘min_time_s_per_reporting’: -1, ‘min_train_timesteps_per_reporting’: -1, ‘min_sample_timesteps_per_reporting’: -1, ‘input_evaluation’: -1, ‘lr_schedule’: None, ‘use_critic’: True, ‘use_gae’: True, ‘kl_coeff’: 0.5003002941138288, ‘sgd_minibatch_size’: 1000, ‘num_sgd_iter’: 1000, ‘shuffle_sequences’: True, ‘vf_loss_coeff’: 0.7, ‘entropy_coeff’: 0.001, ‘entropy_coeff_schedule’: None, ‘clip_param’: 0.9429343265857039, ‘vf_clip_param’: 10.0, ‘grad_clip’: None, ‘kl_target’: 0.01, ‘vf_share_layers’: -1, ‘lambda’: 0.7125712711928637, ‘input’: ‘sampler’, ‘multiagent’: {‘policies’: {‘default_policy’: <ray.rllib.policy.policy.PolicySpec object at 0x7fe03aa32bc0>}, ‘policy_map_capacity’: 100, ‘policy_map_cache’: None, ‘policy_mapping_fn’: None, ‘policies_to_train’: None, ‘observation_fn’: None, ‘replay_mode’: ‘independent’, ‘count_steps_by’: ‘env_steps’}, ‘callbacks’: <class ‘ray.rllib.algorithms.callbacks.DefaultCallbacks’>, ‘create_env_on_driver’: False, ‘custom_eval_function’: None, ‘framework’: ‘tf’, ‘num_cpus_for_driver’: 1}, ‘off_policy_estimation_methods’: {}, ‘evaluation_num_workers’: 0, ‘always_attach_evaluation_results’: False, ‘in_evaluation’: False, ‘sync_filters_on_rollout_workers_timeout_s’: 60.0, ‘keep_per_episode_custom_metrics’: False, ‘metrics_episode_collection_timeout_s’: 60.0, ‘metrics_num_episodes_for_smoothing’: 100, ‘min_time_s_per_iteration’: None, ‘min_train_timesteps_per_iteration’: 0, ‘min_sample_timesteps_per_iteration’: 0, ‘logger_creator’: None, ‘logger_config’: None, ‘log_level’: ‘WARN’, ‘log_sys_usage’: True, ‘fake_sampler’: False, ‘seed’: None, ‘_tf_policy_handles_more_than_one_loss’: False, ‘_disable_preprocessor_api’: False, ‘_disable_action_flattening’: False, ‘_disable_execution_plan_api’: True, ‘simple_optimizer’: False, ‘monitor’: -1, ‘evaluation_num_episodes’: -1, ‘metrics_smoothing_episodes’: -1, ‘timesteps_per_iteration’: -1, ‘min_iter_time_s’: -1, ‘collect_metrics_timeout’: -1, ‘buffer_size’: -1, ‘prioritized_replay’: -1, ‘learning_starts’: -1, ‘replay_batch_size’: -1, ‘replay_sequence_length’: None, ‘prioritized_replay_alpha’: -1, ‘prioritized_replay_beta’: -1, ‘prioritized_replay_eps’: -1, ‘min_time_s_per_reporting’: -1, ‘min_train_timesteps_per_reporting’: -1, ‘min_sample_timesteps_per_reporting’: -1, ‘input_evaluation’: -1, ‘lr_schedule’: None, ‘use_critic’: True, ‘use_gae’: True, ‘kl_coeff’: 0.5003002941138288, ‘sgd_minibatch_size’: 1000, ‘num_sgd_iter’: 1000, ‘shuffle_sequences’: True, ‘vf_loss_coeff’: 0.7, ‘entropy_coeff’: 0.001, ‘entropy_coeff_schedule’: None, ‘clip_param’: 0.9429343265857039, ‘vf_clip_param’: 10.0, ‘grad_clip’: None, ‘kl_target’: 0.01, ‘vf_share_layers’: -1, ‘lambda’: 0.7125712711928637, ‘input’: ‘sampler’, ‘multiagent’: {‘policies’: {‘default_policy’: <ray.rllib.policy.policy.PolicySpec object at 0x7fe03aa32d70>}, ‘policy_map_capacity’: 100, ‘policy_map_cache’: None, ‘policy_mapping_fn’: None, ‘policies_to_train’: None, ‘observation_fn’: None, ‘replay_mode’: ‘independent’, ‘count_steps_by’: ‘env_steps’}, ‘callbacks’: <class ‘ray.rllib.algorithms.callbacks.DefaultCallbacks’>, ‘create_env_on_driver’: False, ‘custom_eval_function’: None, ‘framework’: ‘tf’, ‘num_cpus_for_driver’: 1}

And I would like to know something : I need a checkpoint of my best hyperparameters conf (get_best_result(). checkpoint) but also the best_conf to be able to create a new PPOTrainer(conf=best_conf) for computing a single action on my gym environment, is that it ? I need both checkpoint and best_conf ?

@clement2802, could you by case print here the config with the duplicates or link us to a gist?

My first thought was, there might be several trials with the same episode_mean_reward at the max, but the get_best_result() function does not permit it.

Hi,

I upgraded ray (2.0.0 → 2.2.0) , now I have the good configuration and I can use it to train a new agent.

But now I have another question :

If I launch a HyperOpt search algorithm on a few days, then I will get a (best) checkpoint that I will be able to use again, but how could I extract the best_conf from the checkpoint ? I guess the best_conf is contained inside the checkpoint, right ?

I need the best_conf to make a new PPOTrainer and train an new agent (already trained when restoring the checkpoint) :

rllib_trainer = PPOTrainer(config)
rllib_trainer.restore(checkpoint)

# training
N = nb_episode
results = []
episode_data = []

for n in range(N):
    result = rllib_trainer.train()
    results.append(result)

    episode = {
        "n": n,
        "episode_reward_min": result["episode_reward_min"],
        "episode_reward_mean": result["episode_reward_mean"],
        "episode_reward_max": result["episode_reward_max"],
        "episode_len_mean": result["episode_len_mean"],
    }

    episode_data.append(episode)

    print(f'Max reward: {episode["episode_reward_max"]}')