How to enable debug logs from Ray train's internal checkpoint manager?

psnilesh · February 5, 2026, 6:11am

Ray is deleting checkpoints from s3 for some reason and I need to know why

RunLLM · February 5, 2026, 6:12am

Ray deletes old checkpoints from S3 to enforce the retention policy specified by the num_to_keep parameter in CheckpointConfig. When you set num_to_keep (e.g., to 1 or 2), Ray will keep only the top K checkpoints (by recency or a specified metric) and delete the rest to save storage space. This deletion happens both locally and in cloud storage like S3. If you want to retain all checkpoints, set num_to_keep=None in your CheckpointConfig—otherwise, Ray will automatically remove older checkpoints as new ones are created and uploaded to S3. This behavior is confirmed in the documentation and user reports, and is not a bug but intended to manage storage usage efficiently according to your configuration settings. See examples and discussion in Ray documentation, test code, and GitHub issue #21469.

Would you like more detail on how to configure checkpoint retention or how the deletion process works?

Sources:

Hint: Mention @RunLLM in the post for followups.

psnilesh · February 5, 2026, 6:16am

@RunLLM I don’t think it’s that


trainer = TorchTrainer(
        train_loop_per_worker=train_func,
        train_loop_config={"hidden_size": 1024, "batch_size": 64, "lr": 1e-3, "max_steps": 100},
        scaling_config=ScalingConfig(num_workers=int(os.environ.get("NUM_WORKERS", 4)), use_gpu=True),
        torch_config=TorchConfig(backend="cpu:gloo,cuda:nccl"),
        run_config=RunConfig(
            name=EXPERIMENT_NAME,
            failure_config=ray.train.FailureConfig(max_failures=0),
            storage_path=S3_PATH,
            checkpoint_config=CheckpointConfig(num_to_keep=None)
        ),
    )

What else could cause checkpoint deletions?

RunLLM · February 5, 2026, 6:16am

If you have set num_to_keep=None in CheckpointConfig, Ray should not delete any checkpoints by default. Other causes for checkpoint deletions could include: (1) manual deletion of checkpoints in your training code or scripts, (2) external processes or scripts modifying the S3 bucket, (3) storage cleanup policies on the S3 bucket itself, or (4) issues with trial resumption or experiment restarts that may clean up old checkpoints if the experiment directory is reused. There is no evidence in the provided data sources that Ray deletes checkpoints when num_to_keep=None, unless one of these external factors is involved or the experiment directory is not unique and gets overwritten during new runs (Ray documentation, test_checkpoint_manager.py).

Would you like more detail on how to diagnose which of these factors might be affecting your setup?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
[Tune] How to turn off checkpointing for testing Ray Tune	20	3245	April 18, 2023
Setting a CheckpointConfig doesn't seem to filter out checkpoints correctly Ray Core	3	294	March 26, 2024
Getting Tune to read Train checkpoint in ray.train.report Dashboard, Monitoring & Debugging	2	42	April 4, 2025
RAY tune does not save checkpoint information under experiment path	0	119	April 7, 2024
Ray.train.get_checkpoint() don't get my reported checkpoint Dashboard, Monitoring & Debugging	3	62	August 6, 2024

How to enable debug logs from Ray train's internal checkpoint manager?

Related topics