WARNING:syncer.py:505 -- Last Sync command failed

Error message:
WARNING:syncer.py:505 -- Last Sync command failed: Sync process failed: [Errno 2] Failed to open local file '<hyp_config>_train_parquet_file_1.c000' Detail:[errno 2] No such file or directory -- WARNING tune.py919 -- Trial Runner checkpoting failed :Sync process failed [Errno 2] Failed to open local file '<hyp_config>_valid_parquet_file_22.c000' Detail: [errno 2] No such file or directory

I’m using raytune with tensorflow to train a deep learning model. All the training data is in S3. My workflow is as follows, download parquet file from S3 then create tf.data.Dataset from generator that is created by yielding data within parquet file. Since downloading all parquet files then creating a generator is not possible due to disk constraints, I download one parquet file each, yield data, delete them and this process goes on until all training parquet file has been accessed. So each time I’m downloading ‘train_parquet_file_{i}.c000’ then deleting it.

All logs and hpo results seems to be uploaded to S3 just fine. Also model is training well without parquet file downloaded in local machine training wouldn’t have been successful. So I’m guessing this WARNING is due to unsynced file when I delete the used parquet files. What exactly is this WARNING caused by and how to avoid it?

Hi @Haneul_Kim,

yes, it looks like the tune driver process (that runs the control loop) tries to upload a file from the head node that gets deleted by worker processes.

This is a known shortcoming of the current “syncer-based” approach, and it will be fixed in the upcoming Ray 2.7.

As a workaround, you could download the parquet files to a directory outside the ~/ray_results folder, e.g. to /tmp or so.