How to enable debug logs from Ray train's internal checkpoint manager?

Ray is deleting checkpoints from s3 for some reason and I need to know why

Ray deletes old checkpoints from S3 to enforce the retention policy specified by the num_to_keep parameter in CheckpointConfig. When you set num_to_keep (e.g., to 1 or 2), Ray will keep only the top K checkpoints (by recency or a specified metric) and delete the rest to save storage space. This deletion happens both locally and in cloud storage like S3. If you want to retain all checkpoints, set num_to_keep=None in your CheckpointConfig—otherwise, Ray will automatically remove older checkpoints as new ones are created and uploaded to S3. This behavior is confirmed in the documentation and user reports, and is not a bug but intended to manage storage usage efficiently according to your configuration settings. See examples and discussion in Ray documentation, test code, and GitHub issue #21469.

Would you like more detail on how to configure checkpoint retention or how the deletion process works?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I don’t think it’s that


trainer = TorchTrainer(
        train_loop_per_worker=train_func,
        train_loop_config={"hidden_size": 1024, "batch_size": 64, "lr": 1e-3, "max_steps": 100},
        scaling_config=ScalingConfig(num_workers=int(os.environ.get("NUM_WORKERS", 4)), use_gpu=True),
        torch_config=TorchConfig(backend="cpu:gloo,cuda:nccl"),
        run_config=RunConfig(
            name=EXPERIMENT_NAME,
            failure_config=ray.train.FailureConfig(max_failures=0),
            storage_path=S3_PATH,
            checkpoint_config=CheckpointConfig(num_to_keep=None)
        ),
    )

What else could cause checkpoint deletions?

If you have set num_to_keep=None in CheckpointConfig, Ray should not delete any checkpoints by default. Other causes for checkpoint deletions could include: (1) manual deletion of checkpoints in your training code or scripts, (2) external processes or scripts modifying the S3 bucket, (3) storage cleanup policies on the S3 bucket itself, or (4) issues with trial resumption or experiment restarts that may clean up old checkpoints if the experiment directory is reused. There is no evidence in the provided data sources that Ray deletes checkpoints when num_to_keep=None, unless one of these external factors is involved or the experiment directory is not unique and gets overwritten during new runs (Ray documentation, test_checkpoint_manager.py).

Would you like more detail on how to diagnose which of these factors might be affecting your setup?

Sources:

Hint: Mention @RunLLM in the post for followups.