FileBasedDatasource not using multiple processes

Eli_Stevens · January 4, 2024, 7:49pm

I am using ray 2.9.0.

I have two approaches to loading data from a custom file format, one subclassing Datasource and implementing get_read_tasks, and one subclassing FileBasedDatasource implementing _read_stream.

When passing in parallelism > 1, the Datasource-based approach returns multiple ReadTask instances from source.get_read_tasks(parallelism=8) and uses multiple cores to process them in parallel as expected, like so:

ds = ray.data.read_datasource(
    source,
    parallelism=8,
)
for x in ds.iter_rows():
    pass

The FileBasedDatasource also returns multiple ReadTask instances, but only has one process execute at a time, resulting in slower iteration time.

Why is this happening? How can I make FileBasedDatasource execute in parallel? Nothing that I’ve seen in the source code gives me any clue as to what the difference might be. Thanks!

Topic		Replies	Views
Write custom data streamer Ray Data	8	587	November 8, 2022
Why isn't `ray.data.read_api._get_reader` parallelized? Ray Libraries (Data, Train, Tune, Serve)	0	189	December 5, 2023
Interleaving file reads with custom datasource Ray Libraries (Data, Train, Tune, Serve)	0	228	January 23, 2024
Recommended way to parallelize ray.get() calls to the driver (to pipeline Dataloader) Ray Core	2	323	April 26, 2021
[High] Why doesn't parallelism work with data preprocessing? Ray Serve	14	538	December 28, 2023

FileBasedDatasource not using multiple processes

Related topics