Best prcticies for retrieving data from cluster

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I started my job on an AWS cluster. As a result, my job creates multiple parquet files on each worker (~1GB each, 100-500 GBs total).
I’m wondering, what are the best practices for retrieving these files to my local machine (outside of the cluster)
Current solution: return the content of each file from a remote function, then retrieve the file content from object storage to my local machine. This solution is not scalable and leads to OOM errors.

  1. I refused to use rsync because:
    It’s unclear how to sync from all workers to my local machine.
    I want to remove files from the cluster right after they have been copied to my local machine (to save disk space on cluster).

  2. I refused to use S3 as an intermediate step because it adds unnecessary copying from the cluster to S3, then to my local machine.

I’m considering using generators or actors to copy the file in chunks (not the whole file at once).
However, I’m not sure if this is the best solution.

Probably, there could be best practices for such cases.

While I think using S3 is the best solution, you can do;

  • Use ray client
  • Run a remote task on every node to read file chunk by chunk (e.g., each task reads 100MB of data)
  • Stream data to the client. The client should control the concurrent # of tasks to avoid OOM

Note that this approach still copies data from worker → head. So it is important to admission control max # of tasks coming to the driver.

Also note that Ray data is also a good solution here, but it is not working with ray client.

I think if you use S3, it can be just

  • Use ray data to store the output to S3
  • In your local machine, do streaming read.

Thank you for the detailed answer.