I am using ray 2.9.0.
I have two approaches to loading data from a custom file format, one subclassing
Datasource and implementing
get_read_tasks, and one subclassing
When passing in
parallelism > 1, the
Datasource-based approach returns multiple
ReadTask instances from
source.get_read_tasks(parallelism=8) and uses multiple cores to process them in parallel as expected, like so:
ds = ray.data.read_datasource(
for x in ds.iter_rows():
FileBasedDatasource also returns multiple
ReadTask instances, but only has one process execute at a time, resulting in slower iteration time.
Why is this happening? How can I make
FileBasedDatasource execute in parallel? Nothing that I’ve seen in the source code gives me any clue as to what the difference might be. Thanks!