Hi Team,
I am using the below tune sync configuration to sync the logs from the worker nodes to HDFS.
I have noticed that the logs that are synced to the HDFS are only from the head node. The worker nodes have additional subfolders like rank_0
generated from hugging face trainers that include custom checkpoints and logs from the training code.
import pyarrow.fs
from ray.air._internal.remote_storage import get_fs_and_path
from ray.tune.syncer import _DefaultSyncer
class TuneSyncer(_DefaultSyncer):
def _sync_up_command(self, local_path: str, uri: str, exclude: Optional[List] = None) -> Tuple[Callable, Dict]:
return (
self._upload_to_uri,
dict(local_path=local_path, uri=uri, exclude=exclude),
)
def _upload_to_uri(self, local_path: str, uri: str, exclude: Optional[List[str]] = None) -> None:
fs, bucket_path = get_fs_and_path(uri)
pyarrow.fs.copy_files(local_path, bucket_path, destination_filesystem=fs)
sync_config = tune.SyncConfig(upload_dir=hdfs_upload_dir, syncer=TuneSyncer())
run_config = air.RunConfig(
sync_config=sync_config,
local_dir=f'/home/jobuser/ray_results/{run_name}'
)
tuner = tune.Tuner(
trainable=trainer,
tune_config=tune_config,
run_config=run_config,
param_space=param_space,
)
Do we need to specifically configure anything else to sync all the logs from worker nodes to HDFS directly?
I do not want to have the overhead of copying these checkpoints from the worker node to the head node and then transferring them to HDFS. With the above configuration, this rank_0
folder is not synced to the head node (Which is required to avoid memory overhead on the head node).
I am looking for suggestions that do not include copying the logs to the head node and syncing the logs from the worker node directly to HDFS.
Thank you for your time.
Regards,
Vivek