Setting a CheckpointConfig doesn't seem to filter out checkpoints correctly

kritchie · January 17, 2024, 9:22pm

I’m having some problems where specifying CheckpointConfig doesn’t seem to have the expected outcome.

Here are some info about my setup and what I’m trying to do:

I’m using ray==2.8.1.
I’m running ray.tune experiments and I’m running each trials (2 trials) for a number of 6 iterations (max_t=6).
When calling tune.run, I’m specifying the following checkpoint_config:

checkpoint_config = ray.air.config.CheckpointConfig(
      num_to_keep=1,
      checkpoint_score_attribute="eval_metric", # Validated in my code
      checkpoint_score_order="min",
  )

When I run my tuning pipeline, I can see the ~/ray_results/ directory containing the following checkpoints for each of my trials:

I’d like to keep only the content of checkpoint_XXXXXX directory containing the best checkpoint and not the content of the checkpoints directory because I need to sync this to S3 using storage_path.

Storing all of those checkpoints and it’s taking quite a bit of space (each checkpoint being close to 1Gb). I thought I could achieve this using the checkpoint config object but somehow it’s not working properly.

What am I missing ?

Ruiyang_Wang · January 19, 2024, 12:08am

cc @justinvyu for Tune

akashh · January 20, 2024, 8:17am

I am also observing num_to_keep is not working as expected.

I am using pytorch and ray-2.9.1. For finding good config, I have passed train.RunConfig() with num_to_keep=2 to tune.Tuner(). But still I can see more than 2 checkpoints and size keeps on increasing.
I tried searching example for pytorch with num_to_keep=2 for Tune, but couldn’t found one.

Please help by either providing correct approach or any latest usage example.

kritchie · March 26, 2024, 8:48pm

@Ruiyang_Wang

Any update on this?
This is concerning for us, we can’t rely on the number of checkpoints to keep.

Topic		Replies	Views
Store best checkpoints according to evaluation metrics Checkpointing, Restoring	0	382	June 19, 2023
[Tune] How to turn off checkpointing for testing Ray Tune	20	3084	April 18, 2023
Which attributes can be used in `checkpoint_score_attr` when using `tune.run` RLlib	10	1210	April 20, 2022
Saving best checkpoint - tune is saving first iterations instead Ray Tune	1	497	October 18, 2021
RAY tune does not save checkpoint information under experiment path	0	107	April 7, 2024

Setting a CheckpointConfig doesn't seem to filter out checkpoints correctly

Related topics