Why isn't `ray.data.read_api._get_reader` parallelized?

wayi · December 5, 2023, 6:44am

I am using Ray Data to read 100 Parquet files (each file has ~15MB) on S3 as follows:

start = time.perf_counter()
ds = ray.data.read_parquet("s3://<dir path>")
materialized_ds = ds.materialize()
end = time.perf_counter()

# Plan to do some other data processing afterwards.
# materialized_ds.map_batches(...)

It takes 10+ mins to finish this process. When I checked the traces file from Ray Dashboard, I find that ray.data.read_api._get_reader runs sequentially only a single worker. Can this step be parallelized on multiple workers?

I also tried partitioning the input data as fewer but larger files, manually tuning parallelism, using read_parquet_bulk API instead, but ray.data.read_api._get_reader still shows only being executed on a single worker.

Topic		Replies	Views
Problem with anything on Ray Ray Data	2	627	April 20, 2022
Cannot read parquet from S3	2	795	October 20, 2022
Ray Data read Parquet loads all the data in one go	4	576	October 21, 2023
Read_binary_files does not load data from S3 in parallel Ray Data	1	164	April 9, 2024
Ray worker dies when reading multiple parquet files Ray Data	3	760	November 17, 2022

Why isn't `ray.data.read_api._get_reader` parallelized?

Related topics