I am using ray 2.9.0.
I have two approaches to loading data from a custom file format, one subclassing Datasource
and implementing get_read_tasks
, and one subclassing FileBasedDatasource
implementing _read_stream
.
When passing in parallelism
> 1, the Datasource
-based approach returns multiple ReadTask
instances from source.get_read_tasks(parallelism=8)
and uses multiple cores to process them in parallel as expected, like so:
ds = ray.data.read_datasource(
source,
parallelism=8,
)
for x in ds.iter_rows():
pass
The FileBasedDatasource
also returns multiple ReadTask
instances, but only has one process execute at a time, resulting in slower iteration time.
Why is this happening? How can I make FileBasedDatasource
execute in parallel? Nothing that I’ve seen in the source code gives me any clue as to what the difference might be. Thanks!