How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I started my job on an AWS cluster. As a result, my job creates multiple parquet files on each worker (~1GB each, 100-500 GBs total).
I’m wondering, what are the best practices for retrieving these files to my local machine (outside of the cluster)
Current solution: return the content of each file from a remote function, then retrieve the file content from object storage to my local machine. This solution is not scalable and leads to OOM errors.
-
I refused to use rsync because:
It’s unclear how to sync from all workers to my local machine.
I want to remove files from the cluster right after they have been copied to my local machine (to save disk space on cluster). -
I refused to use S3 as an intermediate step because it adds unnecessary copying from the cluster to S3, then to my local machine.
I’m considering using generators or actors to copy the file in chunks (not the whole file at once).
However, I’m not sure if this is the best solution.
Probably, there could be best practices for such cases.