What is the default PPO network architecture?

cool-RR · March 23, 2024, 10:41am

Hi folks,

I’d appreciate if you could help me get the information I need here. I thought I’d find it in the documentation but apparently not.

What is the default network architecture for RLlib’s PPO implementation? I’m using version 2.6.1 if that matters.

Thanks for your help,
Ram Rachum.

suRuth26 · May 9, 2024, 7:40pm

Hola, estoy usando Trial PPO_CacheEnv_17200_00000 ╭──────── │ Trial PPO_CacheEnv_17200_00000 ├──────── │ _AlgorithmConfig__prior │ _disable_action_flattening │ _disable_execution_plan_api │ _disable_initialize_los │ _disable_preprocessor_api │ _enable_new_api_stack │ _fake_gpus │ _is_atari │ _learner_class │ _rl_module_spec │ _tf_policy_handles_more │ action_mask_key │ action_space │ actions_in_input_normalized │ always_attach_evaluation_results │ auto_wrap_old_gym_envs │ batch_mode │ callbacks │ checkpoint_trainable_policies_only │ clip_actions │ clip_param │ clip_rewards │ compress_observations │ count_steps_by │ create_env_on_driver │ custom_eval_function │ delay_between_worker_restarts_s │ disable_env_checking │ eager_max_retraces │ eager_tracing │ enable_async_evaluation │ enable_connectors │ enable_tf1_exec_eagerly │ entropy_coeff │ entropy_coeff_schedule │ env │ env_config/C │ env_config/cache_size │ env_config/disable_env_checking │ env_config/k1 │ env_config/k2 │ env_config/source_file │ env_runner_cls │ env_task_fn │ evaluation_config │ evaluation_duration │ evaluation_duration_unit │ evaluation_interval │ evaluation_num_workers │ evaluation_parallel_to_training │ evaluation_sample_timeout_s │ exploration_config/type │ explore │ export_native_model_files │ fake_sampler │ framework │ gamma │ grad_clip │ grad_clip_by │ ignore_worker_failures │ in_evaluation │ input │ keep_per_episode_custom_metrics │ kl_coeff │ kl_target │ lambda │ local_gpu_idx │ local_tf_session_args/i │ local_tf_session_args/i │ log_level │ log_sys_usage │ logger_config │ logger_creator │ lr │ lr_schedule │ max_num_worker_restarts │ max_requests_in_flight_ │ metrics_episode_collection_timeout_s │ metrics_num_episodes_for_smoothing │ min_sample_timesteps_per_iteration │ min_time_s_per_iteration │ min_train_timesteps_per_iteration │ model/_disable_action_flattening │ model/_disable_preprocessor_api │ model/_time_major │ model/_use_default_native_models │ model/always_check_shapes │ model/attention_dim │ model/attention_head_dim │ model/attention_init_gru_gate_bias │ model/attention_memory_inference │ model/attention_memory_training │ model/attention_num_heads │ model/attention_num_tra │ model/attention_positio │ model/attention_use_n_prev_actions │ model/attention_use_n_prev_rewards │ model/conv_activation │ model/conv_filters │ model/custom_action_dist │ model/custom_model │ model/custom_preprocessor │ model/dim │ model/encoder_latent_dim │ model/fcnet_activation │ model/fcnet_hiddens │ model/framestack │ model/free_log_std │ model/grayscale │ model/lstm_cell_size │ model/lstm_use_prev_action │ model/lstm_use_prev_action_reward │ model/lstm_use_prev_reward │ model/max_seq_len │ model/no_final_linear │ model/post_fcnet_activation │ model/post_fcnet_hiddens │ model/use_attention │ model/use_lstm │ model/vf_share_layers │ model/zero_mean │ normalize_actions │ num_consecutive_worker_ │ num_cpus_for_driver │ num_cpus_per_learner_worker │ num_cpus_per_worker │ num_envs_per_worker │ num_gpus │ num_gpus_per_learner_worker │ num_gpus_per_worker │ num_learner_workers │ num_sgd_iter │ num_workers │ observation_filter │ observation_fn │ observation_space │ offline_sampling │ ope_split_batch_by_episode │ output │ output_compress_columns │ output_max_file_size │ placement_strategy │ policies/default_policy │ policies_to_train │ policy_map_cache │ policy_map_capacity │ policy_mapping_fn │ policy_states_are_swappable │ postprocess_inputs │ preprocessor_pref │ recreate_failed_workers │ remote_env_batch_wait_ms │ remote_worker_envs │ render_env │ replay_sequence_length │ restart_failed_sub_environments │ rollout_fragment_length │ sample_async │ sample_collector │ sampler_perf_stats_ema_coef │ seed │ sgd_minibatch_size │ shuffle_buffer_size │ shuffle_sequences │ simple_optimizer │ sync_filters_on_rollout │ synchronize_filters │ tf_session_args/allow_soft_placement │ tf_session_args/device_count/CPU │ tf_session_args/gpu_opt │ tf_session_args/inter_o │ tf_session_args/intra_o │ tf_session_args/log_device_placement │ torch_compile_learner │ torch_compile_learner_dynamo_backend │ torch_compile_learner_dynamo_mode │ torch_compile_learner_w │ torch_compile_worker │ torch_compile_worker_dynamo_backend │ torch_compile_worker_dynamo_mode │ train_batch_size │ update_worker_filter_stats │ use_critic │ use_gae │ use_kl_loss │ use_worker_filter_stats │ validate_workers_after_construction │ vf_clip_param │ vf_loss_coeff │ vf_share_layers │ worker_cls │ worker_health_probe_timeout_s │ worker_restore_timeout_s ╰──────── TUNE y al momento de realizar el entrenamiento que da la siguiente informacion que da detalles PPO y el entorno:
started with configuration:
───────────────────────────────────────────────────────────────────╮
config │
───────────────────────────────────────────────────────────────────┤
_exploration_config │
False │
True │
s_from_dummy_batch False │
False │
False │
False │
│
│
│
_than_one_loss False │
action_mask │
│
False │
False │
True │
truncate_episodes │
…efaultCallbacks’> │
False │
False │
0.3 │
│
False │
env_steps │
False │
│
60. │
False │
20 │
True │
False │
True │
False │
0. │
│
…e_env4.CacheEnv’> │
10 │
100 │
True │
7 │
3 │
…ipf/(20000,4).csv │
│
│
│
10 │
episodes │
│
0 │
False │
180. │
StochasticSampling │
True │
False │
False │
torch │
0.99 │
│
global_norm │
False │
False │
sampler │
False │
0.2 │
0.01 │
1. │
0 │
nter_op_parallelism_threads 8 │
ntra_op_parallelism_threads 8 │
WARN │
True │
│
│
0.01 │
│
1000 │
per_sampler_worker 2 │
60. │
100 │
0 │
│
0 │
False │
False │
False │
-1 │
False │
64 │
32 │
2.0 │
50 │
50 │
1 │
nsformer_units 1 │
n_wise_mlp_dim 32 │
0 │
0 │
relu │
│
│
│
│
84 │
│
tanh │
[256, 256] │
True │
False │
False │
256 │
False │
-1 │
False │
20 │
False │
relu │
│
False │
False │
False │
True │
True │
failures_tolerance 100 │
2 │
1 │
1 │
1 │
1 │
0 │
0 │
0 │
30 │
2 │
NoFilter │
│
│
False │
True │
│
[‘obs’, ‘new_obs’] │
67108864 │
PACK │
…None, None, None) │
│
-1 │
100 │
…t 0x7f400dbfb0d0> │
False │
False │
deepmind │
False │
0 │
False │
False │
│
False │
auto │
-1 │
…leListCollector’> │
│
│
128 │
0 │
True │
-1 │
_workers_timeout_s 60. │
-1 │
True │
1 │
ions/allow_growth True │
p_parallelism_threads 2 │
p_parallelism_threads 2 │
False │
False │
inductor │
│
hat_to_compile …ile.FORWARD_TRAIN │
False │
onnxrt │
│
4000 │
True │
True │
True │
True │
True │
True │
10. │
1. │
-1 │
-1 │
60 │
1800 │
───────────────────────────────────────────────────────────────────╯

Topic		Replies	Views
Proper implement of reward scaling in PPO RLlib	0	332	December 17, 2020
RLlib experiments Configure Algorithm, Training, Evaluation, Scaling	0	228	October 22, 2023
PPOConfig + custom_model = no PPO at all? Configure Algorithm, Training, Evaluation, Scaling	0	253	December 28, 2023
Custom Algorithm Configure Algorithm, Training, Evaluation, Scaling	1	493	November 30, 2022
How to turn training off for hidden layers of default PPO network? RLlib	5	864	March 14, 2022

What is the default PPO network architecture?

Related topics