Do trial checkpoints need unique names? < pytorch tutorial>

JuliaWasala · February 9, 2023, 3:30pm

How severe does this issue affect your experience of using Ray?

Low

Hi, I was following the tutorial for using Ray Tune with Pytorch, it works great and was able to adapt it to my own code. However, I was wondering whether you don’t need to save the checkpoints with unique names? In this config it seems that each subsequent trial will just overwrite the pre-existing checkpoints. Thus, if your " best result" is not the last trial, you’ll get the last trial anyways if you load that checkpoint. Or do I misunderstand?

It’s the first time using checkpoints/sessions so I’m not very familiar with the topic. I would appreciate it if someone could clarify or send me the documentation because I did not find my answer in the Ray checkpointing docs.

xwjiang2010 · February 9, 2023, 5:59pm

Hi Julia,
Thanks for the question and glad to hear your experience with Ray Tune.
To answer the question, no, the checkpoints will not be overwritten. In fact, in your ray_result folder, you will see some structure like trial0/checkpoint_00000, trial0/checkpoint_00001 etc. The iteration number of the training loop is reflected in the checkpoint folder name. So for the next iteration, a new postfix is used. This makes sure that no checkpoints will be overwritten!

I will add some section to address your confusion in our documentation as well.

JuliaWasala · February 10, 2023, 8:13am

Thanks for your quick reply! I indeed see what you mean in the ray_result folder. So what is actually the difference between these checkpoints and the ones saved with torch at another location (./my_model in the tutorial)?

xwjiang2010 · February 10, 2023, 10:28pm

Hi,
ah I see. So you probably are looking to see a “my_model” under “checkpoint_00000”.
So if you want to achieve that, you can do session.report(metrics, checkpoint=Checkpoint.from_directory(".")) which will give you the exact same directory path.

The reason it’s designed this way is sometimes one may want to write arbitrary things to the current working directory that they don’t want to include in the checkpoint. my_model is like the container folder that everything underneath it will go to the final checkpoints you see in ray_result. In the example you linked, whether it’s called my_model, or foo doesn’t matter. If you do session.report(metrics, checkpoint=Checkpoint.from_directory("foo")), Ray Tune will make sure that whatever is under foo will show up under checkpoint_0000x. For example, foo/bar will show up as checkpoint_0000x/bar etc. Hope that this helps clarify it a bit.

Topic		Replies	Views
Trial checkpointing	0	290	June 16, 2023
[train] When resuming training, a new `Trial` directory is created, even when resuming from checkpoint	4	432	September 29, 2022
Trouble with some results from Ray Tune	1	41	August 7, 2024
Empty checkpoint files with Tune.run RLlib	1	381	March 30, 2022
Saving best checkpoint - tune is saving first iterations instead Ray Tune	1	496	October 18, 2021

Do trial checkpoints need unique names? < pytorch tutorial>

Related topics