Ray Tune Sync Config not syncing logs from worker node

saivivek15 · April 14, 2023, 9:29pm

Hi Team,
I am using the below tune sync configuration to sync the logs from the worker nodes to HDFS.

I have noticed that the logs that are synced to the HDFS are only from the head node. The worker nodes have additional subfolders like rank_0 generated from hugging face trainers that include custom checkpoints and logs from the training code.

import pyarrow.fs
from ray.air._internal.remote_storage import get_fs_and_path
from ray.tune.syncer import _DefaultSyncer

class TuneSyncer(_DefaultSyncer):
    def _sync_up_command(self, local_path: str, uri: str, exclude: Optional[List] = None) -> Tuple[Callable, Dict]:
        return (
            self._upload_to_uri,
            dict(local_path=local_path, uri=uri, exclude=exclude),
        )

    def _upload_to_uri(self, local_path: str, uri: str, exclude: Optional[List[str]] = None) -> None:
        fs, bucket_path = get_fs_and_path(uri)
        pyarrow.fs.copy_files(local_path, bucket_path, destination_filesystem=fs)

sync_config = tune.SyncConfig(upload_dir=hdfs_upload_dir, syncer=TuneSyncer())
run_config = air.RunConfig(
sync_config=sync_config,
local_dir=f'/home/jobuser/ray_results/{run_name}'
)
tuner = tune.Tuner(
            trainable=trainer,
            tune_config=tune_config,
            run_config=run_config,
            param_space=param_space,
        )

Do we need to specifically configure anything else to sync all the logs from worker nodes to HDFS directly?
I do not want to have the overhead of copying these checkpoints from the worker node to the head node and then transferring them to HDFS. With the above configuration, this rank_0 folder is not synced to the head node (Which is required to avoid memory overhead on the head node).

I am looking for suggestions that do not include copying the logs to the head node and syncing the logs from the worker node directly to HDFS.

Thank you for your time.

Regards,
Vivek

xwjiang2010 · April 17, 2023, 4:07pm

Yeah I believe that is not fully supported yet. I filed [air] Artifact syncing doesn't work for ddp workers · Issue #34475 · ray-project/ray · GitHub to track.

121onto · June 19, 2024, 1:55am

Are there any updates on this thread ? We are having issues syncing artifacts from a tune job and use DDP training.

Topic		Replies	Views
Ray Train sync from worker to head	6	484	April 10, 2023
Only the first few worker nodes sync files (file mount) Ray Clusters	0	48	October 6, 2024
Trouble starting Tune job on local machine	1	300	August 2, 2023
Ray Tune Sync with S3 on 2.2.0 Ray Tune	2	446	January 26, 2023
[Tune] Error when using docker containers and Sync Ray Tune	9	1219	March 16, 2021

Ray Tune Sync Config not syncing logs from worker node

Related topics