Ray data experience OOM issue during write_csv or write_parquet

I create this task to perform ray.data.groupby over the csv files (around 370GB).

dataset = ray.data.read_csv(input_path, parse_options=parse_options)
...
grouped_data = dataset.groupby(key=sort_key)
output = grouped_data.map_groups(lambda a: a)
...
output.write_csv(output_dst) # OOM issue here

Our setup is two machines with 192 cores and 2 TB DRAM in total.
The memory usage is strange, most of the time it consumes ~500 GB on one node, but it bursts to use more than ~900 GB sometime during the write step.
Besides, the write step has been quite slow (contributes most of the time). The current execution time is around 5 hours.
I’ve been trying to reduce nump_cpus to limit memory usage, but it takes too long (more than 15+ hours) to finish.
Our previous attempts with <100 GB data work ok with ray data, not sure >100GB level data has been verified over ray data.
Is there any recommended configuration that could avoid the issue while executing at a reasonable amount of time?

@xiszishu Thanks for posting this. Did you consider partitioning the data?
Here are some performance tips. to partition data for effective use of Ray data.

cc: @chengsu

Thank you! We will take a look.
Our original dataset is partitioned into 1600 files.
We have tried to repartition the output to less number of files (e.g., 300 files and 800 files), but it does not help.
Besides the write step, we also find the map_batch step is abnormally slow.
We also tried the 2.6.1 release, however, we still have the same issue.
Please let us know if there are any further comments on the configurations.