Hello,
I have the following situation:
I have a TorchTrainer
with a RunConfig
where I have set the name=SOME_VALUE
and local_dir=SOME_DIR
. This means my checkpoints are written to SOME_DIR/SOME_VALUE/TorchTrainer_RandomHashEtc_Datetime
.
Now when the training finishes/is cancelled and I want to resume from a checkpoint later, the checkpoint is loaded, but a new directory for the checkpointing of the continuing run is created, i.e. if I start the run as before but give it the last checkpoint, it will still create a new directory with a new has in SOME_DIR/SOME_VALUE/TorchTrainer_RandomHashEtc_Datetime
.
If I now have a checkpoint strategy like this:
checkpoint_config = ray_air.CheckpointConfig(num_to_keep=5)
it means that both directories will keep 5 checkpoints.
What I would expect to happen is that TorchTrainer is able to see that this is the continuation of a single training, so either it does not create a new directory, but continue in the old one OR at least delete the oldest checkpoint in the old directory, when a new checkpoint is saved.
I understand that ray-train now uses ray-tune in the background, and for tuning it makes sense to have one directory for each trial, so it makes sense to have those random-hash directories, however for ray-train I don’t think this makes sense and unfortunately right now it seems that this behavior is not configurable (I traced it to ray/tune/search/basic_variant.py, _TrialIterator->create_trial
, which is called during the training setup at some point and sets trial_id = self.uuid_prefix + ("%05d" % self.counter)
).
Could this be made configurable, such that one can just pass a name, or alternatively, can ray train not create random directories, that are not necessarily required, since it is just one training run?
Is there a default way to circumvent this behavior at the moment?
Thank you!