I am using Ray Data to read 100 Parquet files (each file has ~15MB) on S3 as follows:
start = time.perf_counter()
ds = ray.data.read_parquet("s3://<dir path>")
materialized_ds = ds.materialize()
end = time.perf_counter()
# Plan to do some other data processing afterwards.
# materialized_ds.map_batches(...)
It takes 10+ mins to finish this process. When I checked the traces file from Ray Dashboard, I find that ray.data.read_api._get_reader
runs sequentially only a single worker. Can this step be parallelized on multiple workers?
I also tried partitioning the input data as fewer but larger files, manually tuning parallelism
, using read_parquet_bulk
API instead, but ray.data.read_api._get_reader
still shows only being executed on a single worker.