I create this task to perform ray.data.groupby over the csv files (around 370GB).
dataset = ray.data.read_csv(input_path, parse_options=parse_options)
...
grouped_data = dataset.groupby(key=sort_key)
output = grouped_data.map_groups(lambda a: a)
...
output.write_csv(output_dst) # stuck here
Both map and reduce work fine, but stuck at the write_csv or write_parquet step.
I tried on both single machine and two machines, also tried ray 2.3 and 2.4, same problem.
Any clue to work around this?
@xiszishu Can you try with Ray 2.5.1?
cc: @chengsu
thank you! yes, we tried 2.3.0, 2.4.0, 2.5.0 and 2.5.1, same issue
it seems the tasks are pending for scheduling as shown in dashboard (we have two machines with 192 cores and 2 TB DRAM in total):
For ray 2.5.1, it crashes after being stuck in write_csv for a while. I’ve collected the log files as follows.https://drive.google.com/file/d/1jYOakLc0ZEZSc-OrURi8iNKFKX1XKz7t/view?usp=sharing
So the issue is resolved by using ray job submit, ray client seems to be not compatible well with the cluster.