Read_binary_files does not load data from S3 in parallel

DevKretov · March 21, 2024, 3:38pm

I tried to test Ray Data for parallel processing of binary data that I store in my S3 bucket. There are lots of files I want to process (pdfs, images, doc files, .ical files, etc.) and their size range from several KBs up to 100-200 MBs.

I wanted to use Ray Data, specifically ray.data.read_binary_files method to load the data and then write by custom Actor to process each file in parallel. The problem however is that this function call seems to start downloading all files that I provide (I provide a list of s3:// paths to each file) in a single process without any parallel approach. No concurrency or any other modification helps.

How do I then load lots of files (petabytes) in parallel from S3?

P.S. I run Ray on my EC2 instance in a conda environment in a Jupyter notebook.
P.S.S. My Ray version is 2.9.3.

import ray

ray.init(num_cpus=16)

paths = ['s3://...', 's3://...', ....]

ds = ray.data.read_binary_files(
    paths,
    include_paths=True
)

raulchen · April 9, 2024, 7:54pm

it could be because of the auto-detected parallelism is too small. You can set override_num_blocks=N to manually set a lager parallelism.

Topic		Replies	Views
Why isn't `ray.data.read_api._get_reader` parallelized?	0	190	December 5, 2023
Can Ray Dataset be used between S3 and PyTorch? Ray Data	4	1142	February 17, 2022
Cannot use S3 inside of task? Ray Data	4	987	October 19, 2022
Recipe to process a bunch of files Ray Core	1	495	February 21, 2023
Run Ray Dataset in a big dataset Ray Data	2	1018	June 7, 2022

Read_binary_files does not load data from S3 in parallel

Related topics