What are the advantages of calling `ray.train.report` with upload_mode as `NO_UPLOAD`?

I’ve been reading checkpointing capabilities of Ray Saving and Loading Checkpoints — Ray 2.53.0 , especially custom checkpointing. I understand how ray.train.report with ASYNC upload mode can improve training goodput.

However, my training code has a custom async checkpointing mechanism already that I’m planning to keep. Ray docs say,

If you want to upload the checkpoint synchronously, you can first upload the checkpoint to the storage_path and then report a reference to the uploaded checkpoint with ray.train.CheckpointUploadMode.NO_UPLOAD.

If I’m managing uploads to storage path manually, is there any benefit to calling ray.train.report?

If you are manually uploading checkpoints to the storage path, calling ray.train.report with checkpoint_upload_mode=ray.train.CheckpointUploadMode.NO_UPLOAD is still beneficial. This call registers the checkpoint with Ray Train, allowing Ray to track, resume, and manage experiment state, including checkpoint metadata, best checkpoint selection, and integration with Ray’s result and fault-tolerance mechanisms. Without reporting, Ray Train will not be aware of your checkpoints, and features like auto-resume, experiment analysis, and checkpoint management will not function as intended.

According to the Ray Train checkpointing guide, you should upload the checkpoint yourself, then call ray.train.report(metrics, checkpoint=Checkpoint(path), checkpoint_upload_mode=ray.train.CheckpointUploadMode.NO_UPLOAD) to inform Ray Train of the new checkpoint. This ensures your custom checkpoint is included in Ray’s experiment tracking and recovery features.

Would you like more detail or code examples?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM

How would it help with fault tolerance? More specifically, how would Ray know to download my checkpoint from my custom source to resume training after a fault?

Ray Train’s fault tolerance relies on the latest reported checkpoint to resume training after a failure. When you call ray.train.report with a Checkpoint object (even with NO_UPLOAD), Ray records the checkpoint’s path and metadata. On recovery, Ray will use this path to locate and download the checkpoint for resumption, as long as the path is accessible to all workers and the storage location is correctly configured in your RunConfig (docs, fault tolerance guide).

If your checkpoint is stored in a custom location (e.g., S3, NFS), ensure the path you provide to Checkpoint is valid and reachable by all nodes. Ray will then fetch the checkpoint from this path during recovery, enabling seamless resumption of training from the last reported state.

Would you like a step-by-step example of this process?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Assume my custom checkpoint save logic is writing an object store called “XYZtorage” that is neither a local file system or an object storage like AWS S3. I’m assuming I will need to write some custom download logic to pull down the checkpoint to node. Give me the code for configuring this.

Ray Train expects checkpoints to be accessible via a filesystem interface, typically using pyarrow.fs.FileSystem. If your “XYZtorage” is not supported by default, you must implement a custom pyarrow.fs.FileSystem for both upload and download logic, then pass it to RunConfig(storage_filesystem=...) (persistent storage guide).

Here’s a template for configuring Ray Train with a custom filesystem:

import pyarrow.fs
from ray import train
from ray.train.torch import TorchTrainer

# Implement your custom FileSystem inheriting from pyarrow.fs.FileSystem
class XYZtorageFileSystem(pyarrow.fs.FileSystem):
    def __init__(self, ...):
        super().__init__()
        # Initialize connection to XYZtorage

    def get_file_info(self, paths):
        # Implement file info retrieval
        pass

    def create_dir(self, path, recursive=True):
        # Implement directory creation
        pass

    def open_input_file(self, path):
        # Implement file reading
        pass

    def open_output_stream(self, path):
        # Implement file writing
        pass

    # Implement other required methods...

fs = XYZtorageFileSystem(...)

trainer = TorchTrainer(
    ...,
    run_config=train.RunConfig(
        storage_filesystem=fs,
        storage_path="your_xyz_storage_path",
        name="experiment_name",
    )
)

You must implement all required methods for your custom filesystem. Ray will use this interface to upload and download checkpoints as needed.

Would you like more detail on implementing a custom pyarrow.fs.FileSystem?

Sources:

Hint: Mention @RunLLM in the post for followups.