Possible reasons for ray data stucks at write_csv (or write_parquet)?

xiszishu · July 14, 2023, 1:53pm

I create this task to perform ray.data.groupby over the csv files (around 370GB).

dataset = ray.data.read_csv(input_path, parse_options=parse_options)
...
grouped_data = dataset.groupby(key=sort_key)
output = grouped_data.map_groups(lambda a: a)
...
output.write_csv(output_dst) # stuck here

Both map and reduce work fine, but stuck at the write_csv or write_parquet step.

I tried on both single machine and two machines, also tried ray 2.3 and 2.4, same problem.
Any clue to work around this?

Jules_Damji · July 17, 2023, 4:29pm

@xiszishu Can you try with Ray 2.5.1?
cc: @chengsu

xiszishu · July 18, 2023, 12:16am

thank you! yes, we tried 2.3.0, 2.4.0, 2.5.0 and 2.5.1, same issue
it seems the tasks are pending for scheduling as shown in dashboard (we have two machines with 192 cores and 2 TB DRAM in total):

For ray 2.5.1, it crashes after being stuck in write_csv for a while. I’ve collected the log files as follows.https://drive.google.com/file/d/1jYOakLc0ZEZSc-OrURi8iNKFKX1XKz7t/view?usp=sharing

xiszishu · July 25, 2023, 9:03am

So the issue is resolved by using ray job submit, ray client seems to be not compatible well with the cluster.

Topic		Replies	Views
Ray data experience OOM issue during write_csv or write_parquet Ray Data	2	508	August 2, 2023
Dataset write_csv AttributeError: 'Worker' object has no attribute 'core_worker' Ray Data	2	1282	May 19, 2023
Write_csv saving data on the same node Ray Data	11	839	December 15, 2022
Ray worker dies when reading multiple parquet files Ray Data	3	779	November 17, 2022
[Dataset] Ray Dataset reading multiple parquet files with different columns crashes due to TProtocolException: Exceeded size limit Ray Data	14	1900	November 17, 2022

Possible reasons for ray data stucks at write_csv (or write_parquet)?

Related topics