How severe does this issue affect your experience of using Ray?
Hi, I was following the tutorial for using Ray Tune with Pytorch, it works great and was able to adapt it to my own code. However, I was wondering whether you don’t need to save the checkpoints with unique names? In this config it seems that each subsequent trial will just overwrite the pre-existing checkpoints. Thus, if your " best result" is not the last trial, you’ll get the last trial anyways if you load that checkpoint. Or do I misunderstand?
It’s the first time using checkpoints/sessions so I’m not very familiar with the topic. I would appreciate it if someone could clarify or send me the documentation because I did not find my answer in the Ray checkpointing docs.
Thanks for the question and glad to hear your experience with Ray Tune.
To answer the question, no, the checkpoints will not be overwritten. In fact, in your
ray_result folder, you will see some structure like
trial0/checkpoint_00001 etc. The iteration number of the training loop is reflected in the checkpoint folder name. So for the next iteration, a new postfix is used. This makes sure that no checkpoints will be overwritten!
I will add some section to address your confusion in our documentation as well.
Thanks for your quick reply! I indeed see what you mean in the
ray_result folder. So what is actually the difference between these checkpoints and the ones saved with torch at another location (
./my_model in the tutorial)?
ah I see. So you probably are looking to see a “my_model” under “checkpoint_00000”.
So if you want to achieve that, you can do
session.report(metrics, checkpoint=Checkpoint.from_directory(".")) which will give you the exact same directory path.
The reason it’s designed this way is sometimes one may want to write arbitrary things to the current working directory that they don’t want to include in the checkpoint.
my_model is like the container folder that everything underneath it will go to the final checkpoints you see in
ray_result. In the example you linked, whether it’s called
foo doesn’t matter. If you do
session.report(metrics, checkpoint=Checkpoint.from_directory("foo")), Ray Tune will make sure that whatever is under
foo will show up under
checkpoint_0000x. For example,
foo/bar will show up as
checkpoint_0000x/bar etc. Hope that this helps clarify it a bit.