ValueError: Could not recover from checkpoint

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi everyone,
I’m having some issues with checkpointing during an RLlib tune experiment using PBT scheduler. The specific error is: “ValueError: Could not recover from checkpoint as it does not exist on storage anymore. Got storage fs type local and path: …”. As checkpoint config I’m using train.CheckpointConfig . It seems like the checkpoint is removed and then, during PBT exploitation mechanism, the checkpoint does not exist anymore.

Are you aware of any issues related to this one (ray 2.10)?

Github issue: [RLlib|Tune|Train] ValueError: Could not recover from checkpoint as it does not exist anymore · Issue #45176 · ray-project/ray · GitHub

If anyone has the same issue, here is @justinvyu answer on Github:

PBT with (very frequent) time-based checkpointing and also setting a low num_to_keep is not very stable due to trial scheduling being nondeterministic. Here’s a few tips to get this working:

  • Use training_iteration as the perturbation interval unit instead of time_total_s:

checkpoint_frequency = 2

pbt = PopulationBasedTraining(
    time_attr="time_total_s",
    perturbation_interval=checkpoint_frequency,
    ...,
)
tuner = Tuner(
    ...,
    checkpoint_config=train.CheckpointConfig(
        # num_to_keep=4,   # if disk space is not that big of an issue, keep all checkpoint. otherwise, increase this.
        checkpoint_frequency=checkpoint_frequency,
    )
)
  • Another option is to set synch=True to make sure that all trials are in lock step, so the checkpoint assigned to a trial will never be missing. You should be able to set a lower num_to_keep in this scenario.

In previous ray versions the checkpoints were only created every checkpoint_frequency iterations. Right now a checkpoint is created in each iteration.

This may be a combination of a checkpoint folder naming change, as well as the time-based perturbation interval you have at the moment:

  • Checkpoint folders are now named in terms of checkpoint index, rather than the training_iteration, starting from 0. It increments by 1 each time.
  • A checkpoint is forced to happen on every perturbation interval for high performing trials, which may cause the checkpointing to become more frequent.