I already posted a similar issue here and found other people with the same problem. However I’m still not able to solve my problem. In my train loop I save checkpoints as:
dir = session.get_trial_dir()
checkpoint = Checkpoint.from_directory(dir)
for id, (model, opt) in enumerate(zip(model_type, optimizer)):
torch.save((model, opt.state_dict()), os.path.join(dir, "checkpoint" + str(id)))
chkpt = Checkpoint.from_dict({"loss": val_loss, "running_loss": running_loss, "training_iteration": epoch})
with checkpoint.as_directory() as chkptdir:
chkpt.to_directory(chkptdir)
session.report({"loss": val_loss, "running_loss": running_loss}, checkpoint=checkpoint)
My model training is taking a huge time due to this problem. How to overcome this? Is this a known ray problem? Please let me know, I’m stuck.