How to do checkpoint synchronisation

siva14guru · October 14, 2022, 7:40am

hey,
After training is completed there comes an issue with the checkpoint manager
we are using ray version 2.0
kindly give suggestions to resolve this

Jiao_Dong · October 17, 2022, 4:14am

Hi @siva14guru do you have a minimal script to reproduce what you’re seeing, specially what you had for SyncConfig ?

siva14guru · October 17, 2022, 5:56am

i didn’t specify SyncConfig. we are trying to migrate from 1.12.1 to 2.0.0
where and how to specify sync config?
we are using TorchTrainer

trainer = TorchTrainer(
train_func,
train_loop_config={“lr”: 1e-3, “batch_size”: 64, “epochs”: 4},
scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
)
like this

Topic		Replies	Views
Setting a CheckpointConfig doesn't seem to filter out checkpoints correctly Ray Core	3	265	March 26, 2024
Getting Tune to read Train checkpoint in ray.train.report Dashboard, Monitoring & Debugging	2	21	April 4, 2025
ScalingConfig with Ray Tune Ray Tune	0	304	February 12, 2024
WARNING syncer.py:585 -- Last sync command failed: Sync process failed	3	445	August 10, 2023
RAY tune does not save checkpoint information under experiment path	0	107	April 7, 2024

How to do checkpoint synchronisation

Related topics