I think I found a bug. It seems that when there are many different trials the ray Tuner object cannot find the correct metrics. If you look below there are 2 sections where the “is” score is reported. I got an error when running this that “is” could not be found to get the best trial results
RuntimeError: No best trial found for the given metric: is. This means that no trial has reported this metric, or all values reported for this metric are NaN. To not ignore NaN values, you can set the `filter_nan_and_inf` arg to False.
However I can see a valid “is” score below in the “off policy estimator” section. However there is a second section called “off_policy_estimation” where the “is” score is blank. Is this a bug?
Result for DQN_None_02932_00000:
agent_timesteps_total: 1419
counters:
last_target_update_ts: 1419
num_agent_steps_sampled: 1419
num_agent_steps_trained: 32
num_env_steps_sampled: 1419
num_env_steps_trained: 32
num_target_updates: 1
custom_metrics: {}
date: 2022-10-03_15-23-43
done: true
episode_len_mean: .nan
episode_media: {}
episode_reward_max: .nan
episode_reward_mean: .nan
episode_reward_min: .nan
episodes_this_iter: 0
episodes_total: 0
evaluation:
custom_metrics: {}
episode_len_mean: .nan
episode_media: {}
episode_reward_max: .nan
episode_reward_mean: .nan
episode_reward_min: .nan
episodes_this_iter: 0
hist_stats:
episode_lengths: []
episode_reward: []
num_agent_steps_sampled_this_iter: 14190
num_env_steps_sampled_this_iter: 14190
num_faulty_episodes: 0
num_healthy_workers: 0
num_recreated_workers: 0
off_policy_estimator:
dm_fqe:
v_behavior: -2.178756733166039
v_behavior_std: 2.825201328784723
v_delta: 2.0736164829898476
v_gain: -10514025.017619133
v_target: -0.10514024645090103
v_target_std: 0.07492697238922119
dr_fqe:
v_behavior: -2.178756733166039
v_behavior_std: 2.825201328784723
v_delta: 1.9467460657501043
v_gain: -23201066.741593476
v_target: -0.2320106674159348
v_target_std: 0.2093317772366224
is:
v_behavior: -2.178756733166039
v_behavior_std: 2.825201328784723
v_delta: 2.0171608490363973
v_gain: -16159588.412964145
v_target: -0.16159588412964143
v_target_std: 0.20928801488565427
wis:
v_behavior: -2.178756733166039
v_behavior_std: 2.825201328784723
v_delta: 0.004282480545038237
v_gain: -217447425.26210007
v_target: -2.1744742526210006
v_target_std: 2.832575122650508
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf: {}
timesteps_this_iter: 14190
experiment_id: bc2f86e55c5e4cc5a83f256689a3d9d5
hostname: LAMU02DG1AYMD6R.uhc.com.
info:
last_target_update_ts: 1419
learner:
default_policy:
custom_metrics: {}
learner_stats:
allreduce_latency: 0.0
cur_lr: 0.03775
grad_gnorm: 40.0
max_q: 406.4980163574219
mean_q: 196.5668487548828
min_q: -2.9597431421279907e-06
mean_td_error: 198.6293487548828
model: {}
num_agent_steps_trained: 32.0
off_policy_estimation:
dm_fqe:
loss: 11.95052923605517
dr_fqe:
loss: 12.19793702485873
is: {}
wis: {}
td_error: [1.003274917602539, 339.66650390625, 375.677001953125, 5.004027843475342,
327.1493835449219, 324.4587097167969, 366.9879150390625, -1.996725082397461,
2.003274917602539, 6.004586219787598, 316.7760314941406, 322.8465270996094,
354.6396789550781, 405.4980163574219, 6.075279712677002, 327.3210754394531,
4.075279712677002, 4.999997138977051, 1.0045862197875977, 209.001220703125,
328.0361633300781, 392.8001403808594, 325.0860900878906, 0.0005772840231657028,
-1.9994226694107056, 330.5352783203125, 329.55267333984375, 1.0376253128051758,
2.003274917602539, 248.79647827148438, 350.1905212402344, 351.9041748046875]
num_agent_steps_sampled: 1419
num_agent_steps_trained: 32
num_env_steps_sampled: 1419
num_env_steps_trained: 32
num_target_updates: 1
iterations_since_restore: 1
node_ip: 127.0.0.1
num_agent_steps_sampled: 1419
num_agent_steps_trained: 32
num_env_steps_sampled: 1419
num_env_steps_sampled_this_iter: 1419
num_env_steps_trained: 32
num_env_steps_trained_this_iter: 32
num_faulty_episodes: 0
num_healthy_workers: 0
num_recreated_workers: 0
num_steps_trained_this_iter: 32
perf:
cpu_util_percent: 35.375
ram_util_percent: 57.77094594594595
pid: 200
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf: {}
sampler_results:
custom_metrics: {}
episode_len_mean: .nan
episode_media: {}
episode_reward_max: .nan
episode_reward_mean: .nan
episode_reward_min: .nan
episodes_this_iter: 0
hist_stats:
episode_lengths: []
episode_reward: []
num_faulty_episodes: 0
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf: {}
time_since_restore: 104.2542130947113
time_this_iter_s: 104.2542130947113
time_total_s: 104.2542130947113
timers:
learn_throughput: 2108.187
learn_time_ms: 15.179
load_throughput: 169253.125
load_time_ms: 0.189
synch_weights_time_ms: 0.024
training_iteration_time_ms: 254.77
timestamp: 1664828623
timesteps_since_restore: 0
timesteps_total: 1419
training_iteration: 1
trial_id: 02932_00000
warmup_time: 0.13548803329467773
The only thing I changed between this code and the code above is a parameter search space.
config = (
DQNConfig()
.resources(num_gpus=0 if DEBUG else 1)
.debugging(seed=42 if DEBUG else None)
.environment(**ENV_CONFIG)
**.training(**DQN_TRAIN_CONFIG)**
.framework("torch")
.offline_data(input_='/Users/jweinbe3/PycharmProjects/optumrx-advanced-analytics-personalization/agent/agent_model/sample_batches/train_data')
.evaluation(
evaluation_interval=1,
evaluation_duration=10,
evaluation_num_workers=0,
evaluation_parallel_to_training=True,
evaluation_duration_unit="episodes",
evaluation_config={"input": '/Users/jweinbe3/PycharmProjects/optumrx-advanced-analytics-personalization/agent/agent_model/sample_batches/train_data'},
off_policy_estimation_methods={
"is": {"type": ImportanceSampling},
"wis": {"type": WeightedImportanceSampling},
"dm_fqe": {
"type": DirectMethod,
"q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
},
"dr_fqe": {
"type": DoublyRobust,
"q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
},
},
)
)