Ray cluster is not spilling memory

jaysin60 · April 10, 2024, 7:07pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi,

Problem: I’m trying to process a parquet data using Modin, on aws ec2 machine using multiple instances using the Ray cluster. After sometime the ray head node would just hangs up. I’d just end up restarting the machine.

Commands to start ray cluster:
ray start --head --system-config=‘{“object_spilling_config”:“{"type":"filesystem","params":{"directory_path":"/tmp/spill"}}”}’

More details: I’m reading a 30 GB parquet file from S3 location. The memory_usage() when the data is loaded into dataframe is ~1000GB.
I’m using 4 P3.16xlarge, 1 P3dn.24xlarge and 4 r5.16xlarge instances, this gives me about 2 TB of object_store_memory.

Am I doing something wrong? I don’t see external storage being used by the spilled memory as it would do on a single node machine

ramoc · December 27, 2024, 4:10pm

jaysin60:

Hi,

Problem: I’m trying to process a parquet data using Modin, on aws ec2 machine using multiple instances using the Ray cluster. After sometime the ray head node would just hangs up. I’d just end up restarting the machine.

Commands to start ray cluster:
ray start --head --system-config=‘{“object_spilling_config”:“{“type”:“filesystem”,“params”:{“directory_path”:“/tmp/spill”}}”}’

More details: I’m reading a 30 GB parquet file from S3 location. The memory_usage() when the data is loaded into dataframe is ~1000GB.
I’m using 4 P3.16xlarge, 1 P3dn.24xlarge and 4 r5.16xlarge instances, this gives me about 2 TB of object_store_memory.

Am I doing something wrong? I don’t see external storage being used by the spilled memory as it would do on a single node machine

Hello

It seems the issue might be with memory management or object spilling. Verify that /tmp/spill has sufficient disk space and correct the object_spilling_config command:

ray start --head --system-config=‘{“object_spilling_config”:“{"type":"filesystem","params":{"directory_path":"/tmp/spill"}}”}’

Check the Ray dashboard for memory usage and logs (/tmp/ray/session_latest/logs) for errors. Ensure Modin is distributing tasks across all nodes (modin.set_option("compute_mode", "cluster")). If the problem persists, try testing with smaller datasets or fewer, larger nodes.

Topic		Replies	Views
Remote ray cluster not spilling to disk Ray Clusters	2	105	May 14, 2025
How can I configure ray to never run out of memory? Ray Core	4	1964	March 1, 2021
Optimal cluster settings for Modin dataset creation Ray Data	1	546	January 3, 2023
Repartition kiilled because OOM? Ray Data	1	506	April 22, 2022
Question about confusing object spilling mechanism Ray Core	1	11	August 4, 2025

Ray cluster is not spilling memory

Related topics