How can I save a model using ray train with distributed training?
If I do torch.save() the model is saved on one of my worker nodes instead of the head node.
How can I save a model using ray train with distributed training?
If I do torch.save() the model is saved on one of my worker nodes instead of the head node.
Hey @Dylan_Phoon, this can be done with checkpointing, specifically with the model’s state_dict
.
This should be made clearer in our documentation, sorry for the confusion!