Setting a CheckpointConfig doesn't seem to filter out checkpoints correctly

I’m having some problems where specifying CheckpointConfig doesn’t seem to have the expected outcome.

Here are some info about my setup and what I’m trying to do:

  • I’m using ray==2.8.1.
  • I’m running ray.tune experiments and I’m running each trials (2 trials) for a number of 6 iterations (max_t=6).
  • When calling tune.run, I’m specifying the following checkpoint_config:
checkpoint_config = ray.air.config.CheckpointConfig(
      num_to_keep=1,
      checkpoint_score_attribute="eval_metric", # Validated in my code
      checkpoint_score_order="min",
  )

When I run my tuning pipeline, I can see the ~/ray_results/ directory containing the following checkpoints for each of my trials:

image

I’d like to keep only the content of checkpoint_XXXXXX directory containing the best checkpoint and not the content of the checkpoints directory because I need to sync this to S3 using storage_path.

Storing all of those checkpoints and it’s taking quite a bit of space (each checkpoint being close to 1Gb). I thought I could achieve this using the checkpoint config object but somehow it’s not working properly.

What am I missing ? :thinking:

1 Like

cc @justinvyu for Tune

I am also observing num_to_keep is not working as expected.

I am using pytorch and ray-2.9.1. For finding good config, I have passed train.RunConfig() with num_to_keep=2 to tune.Tuner(). But still I can see more than 2 checkpoints and size keeps on increasing.
I tried searching example for pytorch with num_to_keep=2 for Tune, but couldn’t found one.

Please help by either providing correct approach or any latest usage example.

@Ruiyang_Wang

Any update on this?
This is concerning for us, we can’t rely on the number of checkpoints to keep.