How to use s3fs filesystem to save checkpoints in ray train

milad_heidari · August 2, 2023, 10:09pm

I want to train a simple XGBoost model in a multi-node ray cluster and save the model checkpoints to my Ceph s3 bucket (It’s not Amazon). For doing so, I need to specify access_key_id, secret_access_key, endpoint_url, and bucket_name. according to the documentation, You can add more filesystems by installing fs-spec-compatible filesystems e.g. using pip.. I don’t know where to specify the s3fs filesystem. I mean the RunConfig and CheckpointConfig do not take an argument named as filesystem.

Jules_Damji · August 3, 2023, 4:15pm

@milad_heidari Thanks for the question. We mention or provide a code snippet how you do that in the RunConfig in this release blog.

You specify the cloud storage in the RunConfig storage_path` key-value argument.

trainer = TransformersTrainer(
    trainer_init_per_worker=trainer_init_per_worker,
    scaling_config=ScalingConfig(num_workers=4),
    run_config=RunConfig(
        # Requirement: Use cloud storage
        # Your checkpoints will be found within "s3://your-s3-bucket/example"
        storage_path="s3://your-s3-bucket",
        name="example",
        checkpoint_config=CheckpointConfig(
            _checkpoint_keep_all_ranks=True,
            _checkpoint_upload_from_workers=True,
        ),
    )

milad_heidari · August 18, 2023, 12:43pm

Thanks for the response. The problem is I’m using s3 from ceph (not AWS) and I need to provide an endpoint-url as well as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, but the Ray doesn’t understand the endpoint-url in an environment variable format. How do I address this issue?

Topic		Replies	Views
Unable to use shared s3 persistent storage for ray workers Ray Tune	0	65	July 30, 2024
Try to sync checkpoints to cloud: Sync only called once Ray Tune	5	648	April 6, 2021
RayTune S3 Access Error Ray Tune	0	226	February 14, 2024
Runing ray.train.report(metrics=metrics, checkpoint=checkpoint) Async to maximize GPU usage Ray Train	0	30	November 19, 2024
Setting a CheckpointConfig doesn't seem to filter out checkpoints correctly Ray Core	3	268	March 26, 2024

How to use s3fs filesystem to save checkpoints in ray train

Related topics