How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello !
I’m trying to use read_webdataset to read a large webdataset (30K tar files, 50K samples per tar, total size if around 10 T). I’m using the following code:
import ray
from ray.data import read_webdataset
def launch_learning(temp_dir: str = "/tmp/ray"):
ray.init(_temp_dir=temp_dir)
dataset = read_webdataset(
hdfs_path,
override_num_blocks=60,
)
for row in dataset.iter_rows():
print(row)
break
The previous code is filling up all my memory and my machine ends up crashing because of an OOM issue (like if ray was trying to materialize the whole dataset). For a similar dataset size that contains parquet files, the previous code is working fine. Is this expected ?
I had to specify override_num_block also, otherwise I had the warning:
The requested parallelism of 67030 is more than 4x the number of available CPU slots in the cluster of 15.0.
Can you help me with this ?
Thanks a lot!