Runing ray.train.report(metrics=metrics, checkpoint=checkpoint) Async to maximize GPU usage

Aziz_Belaweid · November 19, 2024, 12:08pm

High: It blocks me to complete my task.

Hello Ray team, my team and I are using ray for training, the model we save is of size 13Gb and it takes around 20min to upload to S3 storage, in the mean time GPU workers are sitting and not doing anything.

In order to maximize the GPU usage, we want to do this upload in the background or asynchronously.

What is the recommended ray way to do this?

Below is a sample of our code:

s3_fs = s3fs.S3FileSystem(
    key=os.getenv('AWS_ACCESS_KEY_ID'),
    secret=os.getenv('AWS_SECRET_ACCESS_KEY'),
    endpoint_url=endpoint,
    client_kwargs=region_dict,
    max_concurrency=20,
)
custom_fs = pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(s3_fs))

in the train_func:

        time_start = time.time()
        save_deepspeed_model(trainer, ckpt_path)
        print(
            f"MIDASTOUCH: Files in the save path after custom save: {os.listdir(ckpt_path)}"
        )
        time_end = time.time()
        print(
            f"MIDASTOUCH:Time taken to save the model: {time_end - time_start} seconds"
        )

        # Report to train session
        checkpoint = Checkpoint.from_directory(tmpdir)
        print(
            "MIDASTOUCH:Reporting to train session/ Uploading the checkpoint to S-3"
        )
        time_start = time.time()
        print(f"Before reporting: {checkpoint.get_metadata()}")
        ray.train.report(metrics=metrics, checkpoint=checkpoint)

        # Add a barrier to ensure all workers finished reporting here
        trainer.strategy.barrier()
        time_end = time.time()

Thank you!

Topic		Replies	Views
Cannot find checkpoint when gpus_per_trial > 0 Ray Tune	8	629	February 28, 2023
Ray Tune on GCP cluster: checkpoint not found after successful sync down Ray Tune	10	1194	April 22, 2021
Ray Tune Sync Threshold Bottleneck Ray Tune	2	70	December 25, 2024
Getting Tune to read Train checkpoint in ray.train.report Dashboard, Monitoring & Debugging	2	26	April 4, 2025
Ray Train code works locally, not in SageMaker PyTorch job Ray Train	15	1131	January 12, 2022

Runing ray.train.report(metrics=metrics, checkpoint=checkpoint) Async to maximize GPU usage

Related topics