How do I save a model with ray train?

Dylan_Phoon · December 6, 2021, 11:11am

How can I save a model using ray train with distributed training?

If I do torch.save() the model is saved on one of my worker nodes instead of the head node.

matthewdeng · December 6, 2021, 5:08pm

Hey @Dylan_Phoon, this can be done with checkpointing, specifically with the model’s state_dict.

This should be made clearer in our documentation, sorry for the confusion!

Topic		Replies	Views
Saving ray model to tf/pytorch Checkpointing, Restoring	0	295	August 11, 2023
How to save models in every iteration? Ray Tune	3	403	August 19, 2023
Best practice to share a torch model across actors Ray Core	4	650	December 27, 2022
Save model without checkpoint Ray Tune	0	392	October 28, 2021
Model Parallelism in Ray Ray Train	9	2874	November 18, 2023