Try to sync checkpoints to cloud: Sync only called once

eicnix · April 4, 2021, 7:29am

In my current setup I would like to upload any checkpoints during tune runs to a s3 compatible storage to serve the best models at a later point. To achieve this I crated a sync config and a sync function.
When I start this setup. The sync is called once at the start of the tune run but never after that. Currently I’m trying to get this to work on my local machine but I would also want to run this in a ray cluster.

    def sync_func(local, remote):
        import boto3
        s3 = boto3.resource('s3',
                            endpoint_url='...',
                            aws_access_key_id='...',
                            aws_secret_access_key='...')
        bucket = s3.Bucket("ray-tests")
        import logging

        for root, dirs, files in os.walk(local):
            dir = os.path.basename(root)
            for file in files:
                path = os.path.join(root, file)
                remote_path = f"{dir}/{file}" if dir is not "" else file
                remote = remote.replace("\\", "/")
                print(f"Trying to upload file {remote}/{remote_path}")
                bucket.upload_file(path, remote + "/" + remote_path)


    sync_config = SyncConfig(
        sync_to_cloud=sync_func,
        sync_on_checkpoint=True,
        upload_dir="trials"
    )

tune.run(
            ImpalaTrainer,
            config=config,
            checkpoint_at_end=True,
            checkpoint_freq=15,
            trial_dirname_creator=lambda x: x.trial_id,
            metric="episode_reward_mean",
            mode="max",
            sync_config=sync_config,
        )

yic · April 5, 2021, 5:57am

@rliaw could you take a look at this?

rliaw · April 5, 2021, 7:58am

@eicnix could you try increasing the frequency of the syncing? I think right now it syncs every 5 minutes.

eicnix · April 5, 2021, 4:34pm

@rliaw Thank you for the suggestion. I set the sync times to 10s:

    sync_config = SyncConfig(
        sync_to_cloud=sync_func,
        sync_on_checkpoint=True,
        sync_to_driver=True,
        cloud_sync_period=10,
        node_sync_period=10,
        upload_dir="trials"
    )

This didn’t change the behaviour at all. I ran three trials with the posted sync config and the sync function was only called once at the beginning. Is the sync function supposed to be a continuous background process?

rliaw · April 5, 2021, 9:35pm

Hmm, it’s supposed to happen once every couple of seconds (per the period).

I know this is a big ask, but could you help me post an issue on Github with a simple script for reproduction? That’d help me quickly diagnose the issue, and I could probably get to it by the end of the week.

eicnix · April 6, 2021, 4:54am

Thank for you for your support. I’m also very keen to have this issue fixed. The Github issue with the reproduction script you can find here: [tune] Sync function is only called once · Issue #15129 · ray-project/ray · GitHub

Topic		Replies	Views
Sync down not happening when using cloud checkpointing Ray Tune	6	616	July 23, 2021
Trouble starting Tune job on local machine	1	304	August 2, 2023
Ray Tune on GCP cluster: checkpoint not found after successful sync down Ray Tune	10	1192	April 22, 2021
Ray Tune Sync with S3 on 2.2.0 Ray Tune	2	449	January 26, 2023
Caught sync error: Sync process failed: Connect timeout on endpoint URL	1	342	October 26, 2023

Try to sync checkpoints to cloud: Sync only called once

Related topics