RLLIB not working with Tune with sample batch input

I think I found a bug. It seems that when there are many different trials the ray Tuner object cannot find the correct metrics. If you look below there are 2 sections where the “is” score is reported. I got an error when running this that “is” could not be found to get the best trial results

RuntimeError: No best trial found for the given metric: is. This means that no trial has reported this metric, or all values reported for this metric are NaN. To not ignore NaN values, you can set the `filter_nan_and_inf` arg to False.

However I can see a valid “is” score below in the “off policy estimator” section. However there is a second section called “off_policy_estimation” where the “is” score is blank. Is this a bug?

Result for DQN_None_02932_00000:
  agent_timesteps_total: 1419
  counters:
    last_target_update_ts: 1419
    num_agent_steps_sampled: 1419
    num_agent_steps_trained: 32
    num_env_steps_sampled: 1419
    num_env_steps_trained: 32
    num_target_updates: 1
  custom_metrics: {}
  date: 2022-10-03_15-23-43
  done: true
  episode_len_mean: .nan
  episode_media: {}
  episode_reward_max: .nan
  episode_reward_mean: .nan
  episode_reward_min: .nan
  episodes_this_iter: 0
  episodes_total: 0
  evaluation:
    custom_metrics: {}
    episode_len_mean: .nan
    episode_media: {}
    episode_reward_max: .nan
    episode_reward_mean: .nan
    episode_reward_min: .nan
    episodes_this_iter: 0
    hist_stats:
      episode_lengths: []
      episode_reward: []
    num_agent_steps_sampled_this_iter: 14190
    num_env_steps_sampled_this_iter: 14190
    num_faulty_episodes: 0
    num_healthy_workers: 0
    num_recreated_workers: 0
    off_policy_estimator:
      dm_fqe:
        v_behavior: -2.178756733166039
        v_behavior_std: 2.825201328784723
        v_delta: 2.0736164829898476
        v_gain: -10514025.017619133
        v_target: -0.10514024645090103
        v_target_std: 0.07492697238922119
      dr_fqe:
        v_behavior: -2.178756733166039
        v_behavior_std: 2.825201328784723
        v_delta: 1.9467460657501043
        v_gain: -23201066.741593476
        v_target: -0.2320106674159348
        v_target_std: 0.2093317772366224
      is:
        v_behavior: -2.178756733166039
        v_behavior_std: 2.825201328784723
        v_delta: 2.0171608490363973
        v_gain: -16159588.412964145
        v_target: -0.16159588412964143
        v_target_std: 0.20928801488565427
      wis:
        v_behavior: -2.178756733166039
        v_behavior_std: 2.825201328784723
        v_delta: 0.004282480545038237
        v_gain: -217447425.26210007
        v_target: -2.1744742526210006
        v_target_std: 2.832575122650508
    policy_reward_max: {}
    policy_reward_mean: {}
    policy_reward_min: {}
    sampler_perf: {}
    timesteps_this_iter: 14190
  experiment_id: bc2f86e55c5e4cc5a83f256689a3d9d5
  hostname: LAMU02DG1AYMD6R.uhc.com.
  info:
    last_target_update_ts: 1419
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          allreduce_latency: 0.0
          cur_lr: 0.03775
          grad_gnorm: 40.0
          max_q: 406.4980163574219
          mean_q: 196.5668487548828
          min_q: -2.9597431421279907e-06
        mean_td_error: 198.6293487548828
        model: {}
        num_agent_steps_trained: 32.0
        off_policy_estimation:
          dm_fqe:
            loss: 11.95052923605517
          dr_fqe:
            loss: 12.19793702485873
          is: {}
          wis: {}
        td_error: [1.003274917602539, 339.66650390625, 375.677001953125, 5.004027843475342,
          327.1493835449219, 324.4587097167969, 366.9879150390625, -1.996725082397461,
          2.003274917602539, 6.004586219787598, 316.7760314941406, 322.8465270996094,
          354.6396789550781, 405.4980163574219, 6.075279712677002, 327.3210754394531,
          4.075279712677002, 4.999997138977051, 1.0045862197875977, 209.001220703125,
          328.0361633300781, 392.8001403808594, 325.0860900878906, 0.0005772840231657028,
          -1.9994226694107056, 330.5352783203125, 329.55267333984375, 1.0376253128051758,
          2.003274917602539, 248.79647827148438, 350.1905212402344, 351.9041748046875]
    num_agent_steps_sampled: 1419
    num_agent_steps_trained: 32
    num_env_steps_sampled: 1419
    num_env_steps_trained: 32
    num_target_updates: 1
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  num_agent_steps_sampled: 1419
  num_agent_steps_trained: 32
  num_env_steps_sampled: 1419
  num_env_steps_sampled_this_iter: 1419
  num_env_steps_trained: 32
  num_env_steps_trained_this_iter: 32
  num_faulty_episodes: 0
  num_healthy_workers: 0
  num_recreated_workers: 0
  num_steps_trained_this_iter: 32
  perf:
    cpu_util_percent: 35.375
    ram_util_percent: 57.77094594594595
  pid: 200
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf: {}
  sampler_results:
    custom_metrics: {}
    episode_len_mean: .nan
    episode_media: {}
    episode_reward_max: .nan
    episode_reward_mean: .nan
    episode_reward_min: .nan
    episodes_this_iter: 0
    hist_stats:
      episode_lengths: []
      episode_reward: []
    num_faulty_episodes: 0
    policy_reward_max: {}
    policy_reward_mean: {}
    policy_reward_min: {}
    sampler_perf: {}
  time_since_restore: 104.2542130947113
  time_this_iter_s: 104.2542130947113
  time_total_s: 104.2542130947113
  timers:
    learn_throughput: 2108.187
    learn_time_ms: 15.179
    load_throughput: 169253.125
    load_time_ms: 0.189
    synch_weights_time_ms: 0.024
    training_iteration_time_ms: 254.77
  timestamp: 1664828623
  timesteps_since_restore: 0
  timesteps_total: 1419
  training_iteration: 1
  trial_id: 02932_00000
  warmup_time: 0.13548803329467773

The only thing I changed between this code and the code above is a parameter search space.

config = (
    DQNConfig()
    .resources(num_gpus=0 if DEBUG else 1)
    .debugging(seed=42 if DEBUG else None)
    .environment(**ENV_CONFIG)
    **.training(**DQN_TRAIN_CONFIG)**
    .framework("torch")
    .offline_data(input_='/Users/jweinbe3/PycharmProjects/optumrx-advanced-analytics-personalization/agent/agent_model/sample_batches/train_data')
    .evaluation(
        evaluation_interval=1,
        evaluation_duration=10,
        evaluation_num_workers=0,
        evaluation_parallel_to_training=True,
        evaluation_duration_unit="episodes",
        evaluation_config={"input": '/Users/jweinbe3/PycharmProjects/optumrx-advanced-analytics-personalization/agent/agent_model/sample_batches/train_data'},
        off_policy_estimation_methods={
            "is": {"type": ImportanceSampling},
            "wis": {"type": WeightedImportanceSampling},
            "dm_fqe": {
                "type": DirectMethod,
                "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
            },
            "dr_fqe": {
                "type": DoublyRobust,
                "q_model_config": {"type": FQETorchModel, "polyak_coef": 0.05},
            },
        },
    )
)

Hey @Jason_Weinberg ,

I don’t know if that’s happening outside of the snippet you posted, but you need to provide the metric you want ray.tune to tune for! Otherwise it will default to the mean reward which is not present in your case!

Cheers

Oh yeah that is here, you can see in the error message I am specifying the “is” metric.

mflow = CustomMLflowLogger(tracking_uri=mlflow_url, experiment_name=experiment_name)
stop = {"training_iteration": 1}


t = Tuner(DQN,
          param_space=config.to_dict(),
          run_config=air.RunConfig(stop=stop, callbacks=[mflow]))
results = t.fit()
#TODO look into why metric cannot be found but I can see it in "off_policy_estimator"
best_model = results.get_best_result(metric='is', mode='max')

Tune does not scan through the whole dictionary to find metrics with the danger of collisions.
You have to specify a full path into the results dictionary where tune can find the metric you want.
The metric you want is at “evaluation/off_policy_estimator/is” and not at “is”!

1 Like

ahh ok, I didn’t know you could use paths in those metrics, interesting syntax on that. Thank you!

This ended up working. My model is training great now thank you!

best_model = results.get_best_result(metric='evaluation/off_policy_estimator/is/v_behavior', mode='max')

I will say that this is not documented very well, you have to do so much trial and error just to figure this out.

3 Likes