Best prcticies for retrieving data from cluster

Aleksei · October 10, 2023, 12:48pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I started my job on an AWS cluster. As a result, my job creates multiple parquet files on each worker (~1GB each, 100-500 GBs total).
I’m wondering, what are the best practices for retrieving these files to my local machine (outside of the cluster)
Current solution: return the content of each file from a remote function, then retrieve the file content from object storage to my local machine. This solution is not scalable and leads to OOM errors.

I refused to use rsync because:
It’s unclear how to sync from all workers to my local machine.
I want to remove files from the cluster right after they have been copied to my local machine (to save disk space on cluster).
I refused to use S3 as an intermediate step because it adds unnecessary copying from the cluster to S3, then to my local machine.

I’m considering using generators or actors to copy the file in chunks (not the whole file at once).
However, I’m not sure if this is the best solution.

Probably, there could be best practices for such cases.

sangcho · October 25, 2023, 3:46pm

While I think using S3 is the best solution, you can do;

Use ray client
Run a remote task on every node to read file chunk by chunk (e.g., each task reads 100MB of data)
Stream data to the client. The client should control the concurrent # of tasks to avoid OOM

Note that this approach still copies data from worker → head. So it is important to admission control max # of tasks coming to the driver.

Also note that Ray data is also a good solution here, but it is not working with ray client.

I think if you use S3, it can be just

Use ray data to store the output to S3
In your local machine, do streaming read.

Aleksei · March 4, 2024, 10:55am

Thank you for the detailed answer.

Topic		Replies	Views
Proper workflow to read local parquet file and use it on remote worker? Ray Data	13	1467	May 24, 2023
Why isn't `ray.data.read_api._get_reader` parallelized?	0	190	December 5, 2023
How do I get data/files out from a ray job?	0	612	July 4, 2023
How to get results back from a job (production scenarios)? Ray Core	2	438	January 30, 2024
Recipe to process a bunch of files Ray Core	1	495	February 21, 2023

Best prcticies for retrieving data from cluster

Related topics