Metadata fetching seems to be a sequential run

ray.data has an API called read_parquet_bulk() which takes in a list of parquet files (not a directory). You can run the following instead:

        import s3fs
        fs = s3fs.S3FileSystem()
        input_ds: Dataset = read_parquet_bulk(
            [
                "s3://" + f
                for f in fs.glob(f"{relevant_directory}/*.parquet")
            ],
        )

This cut down my metadata fetch time from 11 minutes to 18 seconds. Hope this helps!

1 Like