Correct usage of tune sampling in AlgorithmConfig dicts

I know that a similar topic came up in this old thread, but there was no answer communicated.

Basically, the same thing is happening to me now. I would like to use the new AlgorithmConfig API and define 2 parameters in the .training() section as a result of tune sampling. Note that I am using framework “tf” on purpose, as “tf2” always leads to memory crashes on my workstation (even when setting eager_tracing = False).

trainer_config = (
            export_native_model_files=True, checkpoint_trainable_policies_only=True
            model={"custom_model": MyTFModelV2},
            gamma=tune.grid_search([0.9, 0.99, 0.999]),
            lr=tune.loguniform(1e-4, 1e-1),            

Sorry for the wall of text, but as @kai also asked in the thread linked above for the trainer_config, I post it below.
→ Could the issue be related to the fact that tune.loguniform() returns a Float? But how to overcome?

{ ‘_disable_action_flattening’: False,
‘_disable_execution_plan_api’: True,
‘_disable_preprocessor_api’: False,
‘_enable_rl_module_api’: False,
‘_enable_rl_trainer_api’: False,
‘_fake_gpus’: False,
‘_rl_trainer_hps’: RLTrainerHPs(),
‘_tf_policy_handles_more_than_one_loss’: False,
‘action_space’: None,
‘actions_in_input_normalized’: False,
‘always_attach_evaluation_results’: False,
‘auto_wrap_old_gym_envs’: True,
‘batch_mode’: ‘truncate_episodes’,
‘callbacks’: <class ‘rl_chem_pps.callbacks.MyCallback’>,
‘checkpoint_trainable_policies_only’: True,
‘clip_actions’: False,
‘clip_param’: 0.3,
‘clip_rewards’: None,
‘compress_observations’: False,
‘create_env_on_driver’: False,
‘custom_eval_function’: None,
‘custom_resources_per_worker’: { },
‘disable_env_checking’: False,
‘eager_max_retraces’: 20,
‘eager_tracing’: False,
‘enable_async_evaluation’: False,
‘enable_connectors’: True,
‘enable_tf1_exec_eagerly’: False,
‘entropy_coeff’: 0.0,
‘entropy_coeff_schedule’: None,
‘env’: <class ‘rl_chem_pps.MyCustomEnv’>,
‘env_task_fn’: None,
‘evaluation_config’: None,
‘evaluation_duration’: 10,
‘evaluation_duration_unit’: ‘episodes’,
‘evaluation_interval’: None,
‘evaluation_num_workers’: 0,
‘evaluation_parallel_to_training’: False,
‘evaluation_sample_timeout_s’: 180.0,
‘exploration_config’: { ‘type’: ‘StochasticSampling’},
‘explore’: True,
‘export_native_model_files’: True,
‘extra_python_environs_for_driver’: { },
‘extra_python_environs_for_worker’: { },
‘fake_sampler’: False,
‘framework’: ‘tf’,
‘gamma’: { ‘grid_search’: [ 0.9,
‘grad_clip’: None,
‘horizon’: -1,
‘ignore_worker_failures’: False,
‘in_evaluation’: False,
‘input’: ‘sampler’,
‘input_config’: { },
‘is_atari’: None,
‘keep_per_episode_custom_metrics’: False,
‘kl_coeff’: 0.2,
‘kl_target’: 0.01,
‘lambda’: 1.0,
‘local_tf_session_args’: { ‘inter_op_parallelism_threads’: 8,
‘intra_op_parallelism_threads’: 8},
‘log_level’: ‘INFO’,
‘log_sys_usage’: True,
‘logger_config’: None,
‘logger_creator’: None,
‘lr’: < object at 0x000002454A7AC460>,
‘lr_schedule’: None,
‘max_requests_in_flight_per_sampler_worker’: 2,
‘metrics_episode_collection_timeout_s’: 60.0,
‘metrics_num_episodes_for_smoothing’: 100,
‘min_sample_timesteps_per_iteration’: 0,
‘min_time_s_per_iteration’: None,
‘min_train_timesteps_per_iteration’: 0,
‘model’: { ‘_disable_action_flattening’: False,
‘_disable_preprocessor_api’: False,
‘_time_major’: False,
‘_use_default_native_models’: -1,
‘attention_dim’: 64,
‘attention_head_dim’: 32,
‘attention_init_gru_gate_bias’: 2.0,
‘attention_memory_inference’: 50,
‘attention_memory_training’: 50,
‘attention_num_heads’: 1,
‘attention_num_transformer_units’: 1,
‘attention_position_wise_mlp_dim’: 32,
‘attention_use_n_prev_actions’: 0,
‘attention_use_n_prev_rewards’: 0,
‘conv_activation’: ‘relu’,
‘conv_filters’: None,
‘custom_action_dist’: None,
‘custom_model’: <class ‘rl_chem_pps.models.WkActionMaskModel’>,
‘custom_model_config’: { },
‘custom_preprocessor’: None,
‘dim’: 84,
‘fcnet_activation’: ‘tanh’,
‘fcnet_hiddens’: [ 256,
‘framestack’: True,
‘free_log_std’: False,
‘grayscale’: False,
‘lstm_cell_size’: 256,
‘lstm_use_prev_action’: False,
‘lstm_use_prev_action_reward’: -1,
‘lstm_use_prev_reward’: False,
‘max_seq_len’: 20,
‘no_final_linear’: False,
‘post_fcnet_activation’: ‘relu’,
‘post_fcnet_hiddens’: [ ],
‘use_attention’: False,
‘use_lstm’: False,
‘vf_share_layers’: False,
‘zero_mean’: True},
‘multiagent’: { ‘count_steps_by’: ‘env_steps’,
‘observation_fn’: None,
‘policies’: { ‘default_policy’: ( None,
‘policies_to_train’: None,
‘policy_map_cache’: -1,
‘policy_map_capacity’: 100,
‘policy_mapping_fn’: <function AlgorithmConfig.init.. at 0x000002454A792940>},
‘no_done_at_end’: -1,
‘normalize_actions’: True,
‘num_consecutive_worker_failures_tolerance’: 100,
‘num_cpus_for_driver’: 1,
‘num_cpus_per_trainer_worker’: 1,
‘num_cpus_per_worker’: 1,
‘num_envs_per_worker’: 1,
‘num_gpus’: 0,
‘num_gpus_per_trainer_worker’: 0,
‘num_gpus_per_worker’: 0,
‘num_sgd_iter’: 2,
‘num_trainer_workers’: 0,
‘num_workers’: 1,
‘observation_filter’: ‘NoFilter’,
‘observation_space’: None,
‘off_policy_estimation_methods’: { },
‘offline_sampling’: False,
‘ope_split_batch_by_episode’: True,
‘optimizer’: { },
‘output’: None,
‘output_compress_columns’: [ ‘obs’,
‘output_config’: { },
‘output_max_file_size’: 67108864,
‘placement_strategy’: ‘PACK’,
‘policies’: { ‘default_policy’: <ray.rllib.policy.policy.PolicySpec object at 0x000002454A7AC5E0>},
‘policy_states_are_swappable’: False,
‘postprocess_inputs’: False,
‘preprocessor_pref’: ‘deepmind’,
‘recreate_failed_workers’: False,
‘remote_env_batch_wait_ms’: 0,
‘remote_worker_envs’: False,
‘render_env’: False,
‘replay_sequence_length’: None,
‘restart_failed_sub_environments’: False,
‘rl_module_class’: None,
‘rl_trainer_class’: None,
‘rollout_fragment_length’: ‘auto’,
‘sample_async’: False,
‘sample_collector’: <class ‘ray.rllib.evaluation.collectors.simple_list_collector.SimpleListCollector’>,
‘sampler_perf_stats_ema_coef’: None,
‘seed’: None,
‘sgd_minibatch_size’: 5,
‘shuffle_buffer_size’: 0,
‘shuffle_sequences’: True,
‘simple_optimizer’: -1,
‘soft_horizon’: -1,
‘sync_filters_on_rollout_workers_timeout_s’: 60.0,
‘synchronize_filters’: True,
‘tf_session_args’: { ‘allow_soft_placement’: True,
‘device_count’: { ‘CPU’: 1},
‘gpu_options’: { ‘allow_growth’: True},
‘inter_op_parallelism_threads’: 2,
‘intra_op_parallelism_threads’: 2,
‘log_device_placement’: False},
‘train_batch_size’: 10,
‘use_critic’: True,
‘use_gae’: True,
‘validate_workers_after_construction’: True,
‘vf_clip_param’: 10.0,
‘vf_loss_coeff’: 1.0,
‘vf_share_layers’: -1,
‘worker_cls’: None,
‘worker_health_probe_timeout_s’: 60,
‘worker_restore_timeout_s’: 1800}

In order to make sure that tune properly detects the search dimension (for example gamma=tune.grid_search([0.9, 0.99, 0.999])), you should convert the AlgorithmConfig to a dict before passing it into tune.

tune is not 100% compatible with the AlgorithmConfig API. So you should assemble an AlgorithmConfig and give it a couple of search spaces like you did above. Then make sure that these search spaces don’t have implications on other parameters. For example, if you search over “_enable_rl_module_api”, enabling/disabling can impact other fields than the “_enable_rl_module_api” field. After you have made sure that this is not the case for your fields (very few parameters are impacted by this. HPs like gamma or num_sgd_iter are not impacted.), convert the config into a dict with to_dict(). You can check if the resulting dict has all the search spaces you defined previously. Please let us know if anything unexpected happens in this step. Then pass the dict to tune. RLlib will convert this dict back to an AlgorithmConfig object under the hood.