We’ve implemented a custom datasource that reads from an index in postgres and downloads image files from s3.
We are working with a 300+ million scale dataset. The datasource currently reports the correct dataset size for the entire dataset. Thus, Ray decides to split it into many read tasks (1 million+).
This causes a huge delay (30 minutes +) during startup (likely due to pickling the read tasks themselves). We did everything to make the overall pickled read_fn as tiny as possible (_open_block
is a top-level function within the module):
def _create_read_fn(self, offset: int, limit: int):
table_name = self.table_name
def read_fn():
yield _open_block(table_name, offset, limit)
return read_fn
Is there a more efficient way to handle this?