FileBasedDatasource not using multiple processes

I am using ray 2.9.0.

I have two approaches to loading data from a custom file format, one subclassing Datasource and implementing get_read_tasks, and one subclassing FileBasedDatasource implementing _read_stream.

When passing in parallelism > 1, the Datasource-based approach returns multiple ReadTask instances from source.get_read_tasks(parallelism=8) and uses multiple cores to process them in parallel as expected, like so:

ds = ray.data.read_datasource(
    source,
    parallelism=8,
)
for x in ds.iter_rows():
    pass

The FileBasedDatasource also returns multiple ReadTask instances, but only has one process execute at a time, resulting in slower iteration time.

Why is this happening? How can I make FileBasedDatasource execute in parallel? Nothing that I’ve seen in the source code gives me any clue as to what the difference might be. Thanks!