How to start fault tolerance

daxixi · December 8, 2021, 7:01am

Hey guys! Do you know how to start fault tolerance in RayTrain. It is suggested in the document that I should implement loading and saving checkpoint. So, is there anything I need to set apart from that if I want to use fault tolerance feature of RayTrain.

This is the part of my train_func

checkpoint = sgd.load_checkpoint()
if not checkpoint == None:
    model = checkpoint.get("model", 0)
    start_epoch = checkpoint.get("epoch", -1) + 1
else:
    model = create_model('resnet-18', 10)
    start_epoch = 0

    for epoch in range(start_epoch,epochs):
        forward,backward,step,timedur=train_epoch(train_dataloader, model, loss_fn, optimizer, device,epoch)
        sgd.save_checkpoint(epoch=epoch, model=model)
        sgd.report(forward=forward, backward=backward,step=step,time=timedur)

Are these settings enough? Or I need to set on other parameters or codes?

amogkam · December 11, 2021, 2:11am

Hey @daxixi! The way you are using checkpointing looks correct to me! The only thing is you probably want to move the for loop to outside the if/else block. Otherwise no training will actually happen if you are recovering from a checkpoint.

daxixi · December 11, 2021, 2:32am

Thanks very much for your reply and reminding me of that!

Topic		Replies	Views
Error when using train.checkpoint Ray Train	2	1727	December 11, 2021
Trial checkpointing	0	290	June 16, 2023
Ray Trainer looking for more CPU's than that of its initialized on Ray Train	1	725	September 27, 2022
Best approach to load saved checkpoint Ray Train	3	1134	March 30, 2022
Save and reuse Checkpoints in Ray 2.0 version Ray Train	9	1747	November 16, 2022

How to start fault tolerance

Related topics