[Tune PBT] Population Based Training :: Questions & Errors

(The questions I’m posting here will be fairly simple to answer for any experienced user. I’m still in the beginning stages with Ray and am just playing around with many of its utilities. So, any help will be nice!)

Hi,

I’ve run a PBT experiment on PPO with my custom simulator.
Among the 4 trials I ran, only 1 survived after six hours, as you can see from the figure below. I need help with understanding why the errors occurred. I have included the link to the error logs below. If someone can explain the reasons, it will be really helpful.

Furthermore, there are some things I do not understand about Ray’s implementation of PBT.

Below is part of my code that is concerned with PBT. As you can see, I’m trying to optimize these six hyper-parameters: lambda, clip_param, lr, num_sgd_iter, sgd_minibatch_size, train_batch_size.
The questions I have are:

  1. I used tune.qrandint(128, 1024, 128), hoping to have the candidates in the search space to be rounded to the integer increments of 128 as the Tune API states. But, in the pbt_global.txt file, I found values such as 307 and 153. How is this possible?
  2. Can someone help me understand how to interpret the pbt_global.txt file? I’m very lost with it. I Here is the link to pbt_global.txt. I mainly want to know what to look to figure out where each trials changed their hyper-parameters using exploration and exploitation. This will be fairly simple for any experience user, I assume.
pbt = PopulationBasedTraining(
        time_attr="training_iteration",
        perturbation_interval=50,
        metric="episode_reward_max",
        mode="max",
        hyperparam_mutations={
            "lambda": lambda: random.uniform(0.95, 1.0),
            "clip_param": lambda: random.uniform(0.01, 0.5),
            "lr": [1e-3, 5e-4, 1e-4, 5e-5, 1e-5],
            "num_sgd_iter": lambda: random.randint(1, 30),
            "sgd_minibatch_size": tune.qrandint(128, 1024, 128),
            "train_batch_size": tune.qrandint(2_500, 7_500, 2_500)
        }
    )

results = tune.run(
        "PPO",
        name="PBT_PPO",
        config=config,
        checkpoint_freq=1,
        stop={
            "time_total_s": 43_200
        },
        checkpoint_score_attr="episode_reward_max",
        scheduler=pbt,
        num_samples=4,
        local_dir=args.save_dir
    )

Thank you!

Hi @Kai_Yun,

for tune.qrandint - this sampler is used for the initial sampling of hyperparameter values. In population based training, hyperparameters are mutated when a trial exploits another trial - and per the original paper this means that the parameter values are multiplied with 0.8 or 1.2. Hence the 153 - this is just 128*1.2 = 153.6 (rounded down to 153). (And 256 * 1.2 = 307.2 ~= 307).

There’s a couple of things you could do to change this behavior, and the most straightforward one would be to just pass tune.choice instead - e.g. "sgd_minibatch_size": tune.choice(list(range(128, 1025, 128))), - then the mutations will also only use values supplied from the list of categories.

Instead of looking at the hyperparameter globals file, I’d suggest looking at the files of the individual trials. Generally the row is like this: old_tag, new_tag, old_step, new_step, old_conf, new_conf - so each time a trial exploits another trial, you get its old tag, new tag, the step the original trial was at when it exploitet the new trial, the step the new trial had (and the trial will henceforth have as well), the old configuration, and the new (mutated) configuration.

For the tensorflow errors, this might be related to your search space definition and it’s hard to tell without looking at your custom simulator. Though cc @sven1977 if you’ve seen something like this before.

1 Like

Thank you @kai !

I have some other follow-up questions. I made a summary of all my pbt_policy_<trial-number>.txt files based on your answer as you can see below.
The questions I have are:

  1. Why does the first checkpoint of policy_00000 have Tag 2 as its “old_tag”? Shouldn’t it supposed to be Tag 0 since this is its first checkpoint? If you can explain how this tagging works, it’d be helpful.
  2. The first three checkpoints of all policies have the exact same values. Does this simply just mean that they are all taking on the hyperparameters of the same tag since trials in PBT take the best config among the trials?
  3. The pbt_global.txt file has 9 rows. My PBT experiment had 15 checkpoint and 9 perturbations in total. As you can see in the figure below, the total number of checkpoints from all individual pbt files is 15. What should I make of this?

Thanks for the kind answers again.

Hi,

it is a bit tricky to parse the policy file. You can also take a look at the code here to see how we parse this for replay: ray/pbt.py at master · ray-project/ray · GitHub

  1. When a trial exploits another trial, it also copies the exploitation history. As you can see in your PBT global log, trial 2 exploits trial 1 first (second row). Then trial 0 exploits trial 2. So the history here is 1->2->0. In your case, trial 0 still performs bad and later exploits trial 3. However, before then trial 3 never exploited anything else. Thus, if you were to replay that trial, you would just run the trial 3 configuration and to nothing else.

  2. The first three perturbations are the same because trial 0 exploited trial 3, and all other trials exploited trial 0 in the last iteration, copying the full history of trial 0

  3. pbt_global.txt logs all perturbations. A trial might save several checkpoints before exploiting another trial, so the number of checkpoints is unrelated to the number of perturbations

I hope this helps!

1 Like