Ray data experience OOM issue during write_csv or write_parquet

xiszishu · July 28, 2023, 12:54am

I create this task to perform ray.data.groupby over the csv files (around 370GB).

dataset = ray.data.read_csv(input_path, parse_options=parse_options)
...
grouped_data = dataset.groupby(key=sort_key)
output = grouped_data.map_groups(lambda a: a)
...
output.write_csv(output_dst) # OOM issue here

Our setup is two machines with 192 cores and 2 TB DRAM in total.
The memory usage is strange, most of the time it consumes ~500 GB on one node, but it bursts to use more than ~900 GB sometime during the write step.
Besides, the write step has been quite slow (contributes most of the time). The current execution time is around 5 hours.
I’ve been trying to reduce nump_cpus to limit memory usage, but it takes too long (more than 15+ hours) to finish.
Our previous attempts with <100 GB data work ok with ray data, not sure >100GB level data has been verified over ray data.
Is there any recommended configuration that could avoid the issue while executing at a reasonable amount of time?

Jules_Damji · July 31, 2023, 6:23pm

@xiszishu Thanks for posting this. Did you consider partitioning the data?
Here are some performance tips. to partition data for effective use of Ray data.

cc: @chengsu

xiszishu · August 2, 2023, 2:58am

Thank you! We will take a look.
Our original dataset is partitioned into 1600 files.
We have tried to repartition the output to less number of files (e.g., 300 files and 800 files), but it does not help.
Besides the write step, we also find the map_batch step is abnormally slow.
We also tried the 2.6.1 release, however, we still have the same issue.
Please let us know if there are any further comments on the configurations.

Topic		Replies	Views
Possible reasons for ray data stucks at write_csv (or write_parquet)?	3	382	July 25, 2023
Debugging Ray Data out of memory errors	0	236	February 8, 2024
Repartition kiilled because OOM? Ray Data	1	508	April 22, 2022
Benchmarks for Ray Data? Ray Data	13	1069	October 5, 2023
Ray Data read Parquet loads all the data in one go	4	619	October 21, 2023

Ray data experience OOM issue during write_csv or write_parquet

Related topics