ray.data has an API called read_parquet_bulk()
which takes in a list of parquet files (not a directory). You can run the following instead:
import s3fs
fs = s3fs.S3FileSystem()
input_ds: Dataset = read_parquet_bulk(
[
"s3://" + f
for f in fs.glob(f"{relevant_directory}/*.parquet")
],
)
This cut down my metadata fetch time from 11 minutes to 18 seconds. Hope this helps!