[train] When resuming training, a new `Trial` directory is created, even when resuming from checkpoint

M_S · September 26, 2022, 8:27am

Hello,

I have the following situation:

I have a TorchTrainer with a RunConfig where I have set the name=SOME_VALUE and local_dir=SOME_DIR. This means my checkpoints are written to SOME_DIR/SOME_VALUE/TorchTrainer_RandomHashEtc_Datetime.

Now when the training finishes/is cancelled and I want to resume from a checkpoint later, the checkpoint is loaded, but a new directory for the checkpointing of the continuing run is created, i.e. if I start the run as before but give it the last checkpoint, it will still create a new directory with a new has in SOME_DIR/SOME_VALUE/TorchTrainer_RandomHashEtc_Datetime.
If I now have a checkpoint strategy like this:
checkpoint_config = ray_air.CheckpointConfig(num_to_keep=5)
it means that both directories will keep 5 checkpoints.

What I would expect to happen is that TorchTrainer is able to see that this is the continuation of a single training, so either it does not create a new directory, but continue in the old one OR at least delete the oldest checkpoint in the old directory, when a new checkpoint is saved.

I understand that ray-train now uses ray-tune in the background, and for tuning it makes sense to have one directory for each trial, so it makes sense to have those random-hash directories, however for ray-train I don’t think this makes sense and unfortunately right now it seems that this behavior is not configurable (I traced it to ray/tune/search/basic_variant.py, _TrialIterator->create_trial, which is called during the training setup at some point and sets trial_id = self.uuid_prefix + ("%05d" % self.counter)).

Could this be made configurable, such that one can just pass a name, or alternatively, can ray train not create random directories, that are not necessarily required, since it is just one training run?

Is there a default way to circumvent this behavior at the moment?

Thank you!

amogkam · September 27, 2022, 9:47pm

Hey @M_S, this is great feedback! I created 2 issues here to track these: [Tune] Use same directory when resuming an experiment · Issue #28829 · ray-project/ray · GitHub, [AIR] [Tune] Don't add random hash to trial id for single trial · Issue #28830 · ray-project/ray · GitHub.

For the second issue, how big of an issue is this? Can you just cd into the trial directory and then use it as normal?

M_S · September 28, 2022, 6:54am

Hi @amogkam,

thanks for putting this into issues.

I can work around this, so it is nothing urgent for me at the moment, I would appreciate if this would get fixed in one of the next releases though.

amogkam · September 28, 2022, 3:13pm

Thanks @M_S. For our own prioritization, it would help if you could provide more information for the second issue. Is there a particular workflow that is difficult to do because of the hash that is added?

M_S · September 29, 2022, 7:24am

Hi @amogkam,

in principle the hash is not an issue, the main annoyance was the first issue. So if the first issue is fixed and the Trainer is able to identify the last directory and continue there, I would be happy with that solution.
Alternatively, if one is able to choose a name instead of the hash, that could be a way to land in the same directory, because the directory name is not random, so in a sense issue 1 and 2 probably are related, at least when it comes to the problems I described in my initial posting.

Topic		Replies	Views
Do trial checkpoints need unique names? < pytorch tutorial> Ray Tune	3	505	February 10, 2023
Nested checkpoint directories	1	236	June 13, 2023
Resuming Trials with New Checkpoint_Score_Attr / Best Metric	0	457	January 4, 2022
Not a Directory error when loading checkpoint population based training Ray Tune	2	793	February 23, 2022
Allow overwriting trial dir instead of creating a new one Ray Tune	3	346	January 25, 2023

[train] When resuming training, a new `Trial` directory is created, even when resuming from checkpoint

Related topics