How severe does this issue affect your experience of using Ray?
Hi, I was following the tutorial for using Ray Tune with Pytorch, it works great and was able to adapt it to my own code. However, I was wondering whether you don’t need to save the checkpoints with unique names? In this config it seems that each subsequent trial will just overwrite the pre-existing checkpoints. Thus, if your " best result" is not the last trial, you’ll get the last trial anyways if you load that checkpoint. Or do I misunderstand?
It’s the first time using checkpoints/sessions so I’m not very familiar with the topic. I would appreciate it if someone could clarify or send me the documentation because I did not find my answer in the Ray checkpointing docs.
Hi Julia,
Thanks for the question and glad to hear your experience with Ray Tune.
To answer the question, no, the checkpoints will not be overwritten. In fact, in your ray_result
folder, you will see some structure like trial0/checkpoint_00000
, trial0/checkpoint_00001
etc. The iteration number of the training loop is reflected in the checkpoint folder name. So for the next iteration, a new postfix is used. This makes sure that no checkpoints will be overwritten!
I will add some section to address your confusion in our documentation as well.
Thanks for your quick reply! I indeed see what you mean in the ray_result
folder. So what is actually the difference between these checkpoints and the ones saved with torch at another location (./my_model
in the tutorial)?
Hi,
ah I see. So you probably are looking to see a “my_model” under “checkpoint_00000”.
So if you want to achieve that, you can do session.report(metrics, checkpoint=Checkpoint.from_directory("."))
which will give you the exact same directory path.
The reason it’s designed this way is sometimes one may want to write arbitrary things to the current working directory that they don’t want to include in the checkpoint. my_model
is like the container folder that everything underneath it will go to the final checkpoints you see in ray_result
. In the example you linked, whether it’s called my_model
, or foo
doesn’t matter. If you do session.report(metrics, checkpoint=Checkpoint.from_directory("foo"))
, Ray Tune will make sure that whatever is under foo
will show up under checkpoint_0000x
. For example, foo/bar
will show up as checkpoint_0000x/bar
etc. Hope that this helps clarify it a bit.