Repartition kiilled because OOM?

hahdawg · April 12, 2022, 7:51pm

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

I have a 100GB parquet dataset in google cloud storage. Currently, there are 500 partitions, and I’m trying to change it to 2000 partitions. I attempted to run the code below on a cluster of 5 nodes, each of which has 600GB of RAM.

import gcsfs
import ray.data as rd

gs_project = "my-project"
pq_paths = ["gs://path-1.parquet", ..., "gs://path-500.parquet"]
fs = gcsfs.GCSFileSystem(project=gs_project)
output_dir = "gs://output.parquet"
num_partitions = 2000
(
    rd.read_parquet(pq_paths, filesystem=fs, ray_remote_args={"max_retries": 10}).
    repartition(num_blocks=num_partitions, shuffle=False).
    write_parquet(output_dir, filesystem=fs)
)

When this runs, memory spikes sharply to ~100% on 2 of my nodes, then I just see “killed” in bash. I’m using ray 1.11.0. If it helps, this snippet worked with ray 1.8.0: Memory would spike, but plasma would fill up, and data would spill to disk.

Is there anything I can do to fix it?

Clark_Zinzow · April 22, 2022, 5:27pm

Hi @hahdawg, thanks for posting!

That suggests that data/tasks are not getting properly load balanced across the cluster, which should happen automatically. A few clarifying questions and suggestions:

What’s the memory utilization of the other 3 nodes when that spike happens?
How large is the in-memory representation of the data?
You’re saying that this exact same snippet worked on Ray 1.8, correct?
Could you try the recent Ray 1.12 release and see if it still OOMs?

Topic		Replies	Views
Cannot read parquet files Ray Data	2	660	April 19, 2023
Ray data experience OOM issue during write_csv or write_parquet Ray Data	2	520	August 2, 2023
Problem with anything on Ray Ray Data	2	653	April 20, 2022
Optimal cluster settings for Modin dataset creation Ray Data	1	554	January 3, 2023
OOM reading "small" parquet file Ray Data	2	1274	September 1, 2022

Repartition kiilled because OOM?

Related topics