Read large webdatasets

RomainL · October 21, 2024, 9:59am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hello !

I’m trying to use read_webdataset to read a large webdataset (30K tar files, 50K samples per tar, total size if around 10 T). I’m using the following code:

import ray
from ray.data import read_webdataset

def launch_learning(temp_dir: str = "/tmp/ray"):
    ray.init(_temp_dir=temp_dir)
    dataset = read_webdataset(
        hdfs_path,
        override_num_blocks=60,
    )
    for row in dataset.iter_rows():
        print(row)
        break

The previous code is filling up all my memory and my machine ends up crashing because of an OOM issue (like if ray was trying to materialize the whole dataset). For a similar dataset size that contains parquet files, the previous code is working fine. Is this expected ?

I had to specify override_num_block also, otherwise I had the warning:
The requested parallelism of 67030 is more than 4x the number of available CPU slots in the cluster of 15.0.

Can you help me with this ?
Thanks a lot!

raulchen · October 22, 2024, 5:51pm

try increasing override_num_blocks.

Topic		Replies	Views
Ray Data read Parquet loads all the data in one go	4	610	October 21, 2023
Best practices around handling giant datasets with ray data (large amount of read tasks)	5	185	October 15, 2024
Ray.data.read_csv Huge Dataset memory limitations	0	242	September 5, 2023
How to convert Pytorch torch.utils.data.Dataset to ray.data.dataset?	15	1403	December 8, 2022
Problem with anything on Ray Ray Data	2	636	April 20, 2022

Read large webdatasets

Related topics