How to get the trial directory or trial ID of the experiment used for initializing the weights of a mutation generated by PB2 or PopulationBasedTraining

mnazemi · February 13, 2023, 5:40am

My understanding of PB2 and PopulationBasedTraining is that they use checkpoints from other trials when initializing a model created for a new mutation.

As shown in this guide, one has to use an if session.get_checkpoint(): conditional statement to see if the new trial will use a checkpoint from an earlier trial.

I was wondering if there is a way to get the trial_dir or trial_id of the trial from which the checkpoint is loaded. My training function supports storing and loading checkpoints, and I would like to avoid the overhead of storing the same checkpoints with session.report(). My training function would then look like this:

if session.get_checkpoint():  # or some other flag
    # Assuming session.get_source_trial_dir()
    # or something similar exits
    source_trial_dir = session.get_source_trial_dir()

    # By setting args.checkpoint_path,
    # the training function will load the weights
    # (and optionally, optimizer parameters),
    # from the specified checkpoint.
    args.checkpoint_path = source_trial_dir + "checkpoint.pth.tar"

mnazemi · February 13, 2023, 5:53am

A hacky solution would look like this:

# When creating a checkpoint, only store its path
if self.args.tune != "":
    checkpoint_path = os.path.join(
        session.get_trial_dir(), self.logdir, "checkpoint.pth.tar"
    )
    checkpoint = Checkpoint.from_dict({"checkpoint_path": checkpoint_path})

    session.report(
        {"loss": loss, "accuracy": top1}, checkpoint=checkpoint
    )

if session.get_checkpoint():
    checkpoint_dict = session.get_checkpoint().to_dict()
    args.checkpoint_path = checkpoint_dict["checkpoint_path"]

xwjiang2010 · February 13, 2023, 4:57pm

Why is that needed?
Let’s say you for trialB, at iteration n, you load from trialA’s checkpoint and add some mutation and start training. When you get to the next checkpoint stage (iteration m) for trialB, you always want to still checkpoint that for

trialB is not exactly as A, since there is some mutation
trialB has already made some non trivial progress (from iteration n to iteration m).

Topic		Replies	Views
Trial checkpointing	0	292	June 16, 2023
Do trial checkpoints need unique names? < pytorch tutorial> Ray Tune	3	479	February 10, 2023
Population based training (PBT) with checkpoint restore	8	695	March 27, 2023
Empty checkpoint files with Tune.run RLlib	1	387	March 30, 2022
ValueError: The returned checkpoint path must be within the given checkpoint dir Ray Tune	7	398	January 25, 2021

How to get the trial directory or trial ID of the experiment used for initializing the weights of a mutation generated by PB2 or PopulationBasedTraining

Related topics