Metadata fetching seems to be a sequential run

I have 10k small files (each file size is around 2MB) located on s3, the reading process of ray data seems very slow at the metadata fetching stage, from the logs it seems it’s reading data in sequential.

Can this run be parallelized? Otherwise it looks to me pretty awkward to spend 10 minutes reading the metadata only.

Metadata Fetch Progress 0: 100%|██████████| 4.18k/4.18k [02:53<00:00, 24.6 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.18k/4.18k [02:54<00:00, 24.6 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.18k/4.18k [02:55<00:00, 24.6 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.18k/4.18k [02:56<00:00, 24.6 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.18k/4.18k [02:57<00:00, 24.6 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.18k/4.18k [02:58<00:00, 24.6 task/s]
Metadata Fetch Progress 0:  99%|█████████▉| 4.18k/4.21k [02:59<00:01, 24.6 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.21k/4.21k [02:59<00:00, 16.0 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.21k/4.21k [02:59<00:00, 16.0 task/s]
Metadata Fetch Progress 0:  97%|█████████▋| 4.21k/4.33k [03:00<00:07, 16.0 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.33k/4.33k [03:00<00:00, 25.6 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.33k/4.33k [03:00<00:00, 25.6 task/s]
Metadata Fetch Progress 0:  98%|█████████▊| 4.33k/4.42k [03:01<00:03, 25.6 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.42k/4.42k [03:01<00:00, 33.1 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.42k/4.42k [03:01<00:00, 33.1 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.42k/4.42k [03:02<00:00, 33.1 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.42k/4.42k [03:03<00:00, 33.1 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.42k/4.42k [03:04<00:00, 33.1 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.42k/4.42k [03:05<00:00, 33.1 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.42k/4.42k [03:06<00:00, 33.1 task/s]
Metadata Fetch Progress 0:  99%|█████████▊| 4.42k/4.48k [03:07<00:01, 33.1 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.48k/4.48k [03:07<00:00, 21.9 task/s]
Metadata Fetch Progress 0: 100%|██████████| 4.48k/4.48k [03:07<00:00, 21.9 task/s]

ray.data has an API called read_parquet_bulk() which takes in a list of parquet files (not a directory). You can run the following instead:

        import s3fs
        fs = s3fs.S3FileSystem()
        input_ds: Dataset = read_parquet_bulk(
            [
                "s3://" + f
                for f in fs.glob(f"{relevant_directory}/*.parquet")
            ],
        )

This cut down my metadata fetch time from 11 minutes to 18 seconds. Hope this helps!

1 Like