[train] When resuming training, a new `Trial` directory is created, even when resuming from checkpoint

Hello,

I have the following situation:

I have a TorchTrainer with a RunConfig where I have set the name=SOME_VALUE and local_dir=SOME_DIR. This means my checkpoints are written to SOME_DIR/SOME_VALUE/TorchTrainer_RandomHashEtc_Datetime.

Now when the training finishes/is cancelled and I want to resume from a checkpoint later, the checkpoint is loaded, but a new directory for the checkpointing of the continuing run is created, i.e. if I start the run as before but give it the last checkpoint, it will still create a new directory with a new has in SOME_DIR/SOME_VALUE/TorchTrainer_RandomHashEtc_Datetime.
If I now have a checkpoint strategy like this:
checkpoint_config = ray_air.CheckpointConfig(num_to_keep=5)
it means that both directories will keep 5 checkpoints.

What I would expect to happen is that TorchTrainer is able to see that this is the continuation of a single training, so either it does not create a new directory, but continue in the old one OR at least delete the oldest checkpoint in the old directory, when a new checkpoint is saved.

I understand that ray-train now uses ray-tune in the background, and for tuning it makes sense to have one directory for each trial, so it makes sense to have those random-hash directories, however for ray-train I don’t think this makes sense and unfortunately right now it seems that this behavior is not configurable (I traced it to ray/tune/search/basic_variant.py, _TrialIterator->create_trial, which is called during the training setup at some point and sets trial_id = self.uuid_prefix + ("%05d" % self.counter)).

Could this be made configurable, such that one can just pass a name, or alternatively, can ray train not create random directories, that are not necessarily required, since it is just one training run?

Is there a default way to circumvent this behavior at the moment?

Thank you!

Hey @M_S, this is great feedback! I created 2 issues here to track these: [Tune] Use same directory when resuming an experiment · Issue #28829 · ray-project/ray · GitHub, [AIR] [Tune] Don't add random hash to trial id for single trial · Issue #28830 · ray-project/ray · GitHub.

For the second issue, how big of an issue is this? Can you just cd into the trial directory and then use it as normal?

1 Like

Hi @amogkam,

thanks for putting this into issues.

I can work around this, so it is nothing urgent for me at the moment, I would appreciate if this would get fixed in one of the next releases though.

Thanks @M_S. For our own prioritization, it would help if you could provide more information for the second issue. Is there a particular workflow that is difficult to do because of the hash that is added?

Hi @amogkam,

in principle the hash is not an issue, the main annoyance was the first issue. So if the first issue is fixed and the Trainer is able to identify the last directory and continue there, I would be happy with that solution.
Alternatively, if one is able to choose a name instead of the hash, that could be a way to land in the same directory, because the directory name is not random, so in a sense issue 1 and 2 probably are related, at least when it comes to the problems I described in my initial posting.