I create this task to perform ray.data.groupby over the csv files (around 370GB).
dataset = ray.data.read_csv(input_path, parse_options=parse_options)
...
grouped_data = dataset.groupby(key=sort_key)
output = grouped_data.map_groups(lambda a: a)
...
output.write_csv(output_dst) # OOM issue here
Our setup is two machines with 192 cores and 2 TB DRAM in total.
The memory usage is strange, most of the time it consumes ~500 GB on one node, but it bursts to use more than ~900 GB sometime during the write step.
Besides, the write step has been quite slow (contributes most of the time). The current execution time is around 5 hours.
I’ve been trying to reduce nump_cpus to limit memory usage, but it takes too long (more than 15+ hours) to finish.
Our previous attempts with <100 GB data work ok with ray data, not sure >100GB level data has been verified over ray data.
Is there any recommended configuration that could avoid the issue while executing at a reasonable amount of time?