ValueError: Could not recover from checkpoint

Ciro · May 7, 2024, 9:25am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi everyone,
I’m having some issues with checkpointing during an RLlib tune experiment using PBT scheduler. The specific error is: “ValueError: Could not recover from checkpoint as it does not exist on storage anymore. Got storage fs type local and path: …”. As checkpoint config I’m using train.CheckpointConfig . It seems like the checkpoint is removed and then, during PBT exploitation mechanism, the checkpoint does not exist anymore.

Are you aware of any issues related to this one (ray 2.10)?

Ciro · May 7, 2024, 12:39pm

Github issue: [RLlib|Tune|Train] ValueError: Could not recover from checkpoint as it does not exist anymore · Issue #45176 · ray-project/ray · GitHub

Ciro · May 8, 2024, 8:45am

If anyone has the same issue, here is @justinvyu answer on Github:

PBT with (very frequent) time-based checkpointing and also setting a low num_to_keep is not very stable due to trial scheduling being nondeterministic. Here’s a few tips to get this working:

Use training_iteration as the perturbation interval unit instead of time_total_s:


checkpoint_frequency = 2

pbt = PopulationBasedTraining(
    time_attr="time_total_s",
    perturbation_interval=checkpoint_frequency,
    ...,
)
tuner = Tuner(
    ...,
    checkpoint_config=train.CheckpointConfig(
        # num_to_keep=4,   # if disk space is not that big of an issue, keep all checkpoint. otherwise, increase this.
        checkpoint_frequency=checkpoint_frequency,
    )
)

Another option is to set synch=True to make sure that all trials are in lock step, so the checkpoint assigned to a trial will never be missing. You should be able to set a lower num_to_keep in this scenario.

In previous ray versions the checkpoints were only created every checkpoint_frequency iterations. Right now a checkpoint is created in each iteration.

This may be a combination of a checkpoint folder naming change, as well as the time-based perturbation interval you have at the moment:

Checkpoint folders are now named in terms of checkpoint index, rather than the training_iteration, starting from 0. It increments by 1 each time.
A checkpoint is forced to happen on every perturbation interval for high performing trials, which may cause the checkpointing to become more frequent.

Topic		Replies	Views
ValueError: The returned checkpoint path must be within the given checkpoint dir Ray Tune	7	395	January 25, 2021
Can't save Checkpoint wenn using Tensorflow and PBT Ray Tune	4	1325	January 12, 2021
PBT using DurableTrainable raises ValueError: `checkpoint_dir` must be `self.logdir`, or a sub-directory Ray Tune	0	392	November 1, 2021
Unable to restore fully trained checkpoint RLlib	19	2916	October 21, 2023
Issue in saving checkpoints	1	549	November 16, 2022

ValueError: Could not recover from checkpoint

Related topics