Ray cluster is not spilling memory

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi,

Problem: I’m trying to process a parquet data using Modin, on aws ec2 machine using multiple instances using the Ray cluster. After sometime the ray head node would just hangs up. I’d just end up restarting the machine.

Commands to start ray cluster:
ray start --head --system-config=‘{“object_spilling_config”:“{"type":"filesystem","params":{"directory_path":"/tmp/spill"}}”}’

More details: I’m reading a 30 GB parquet file from S3 location. The memory_usage() when the data is loaded into dataframe is ~1000GB.
I’m using 4 P3.16xlarge, 1 P3dn.24xlarge and 4 r5.16xlarge instances, this gives me about 2 TB of object_store_memory.

Am I doing something wrong? I don’t see external storage being used by the spilled memory as it would do on a single node machine

Hello

It seems the issue might be with memory management or object spilling. Verify that /tmp/spill has sufficient disk space and correct the object_spilling_config command:

ray start --head --system-config=‘{“object_spilling_config”:“{"type":"filesystem","params":{"directory_path":"/tmp/spill"}}”}’

Check the Ray dashboard for memory usage and logs (/tmp/ray/session_latest/logs) for errors. Ensure Modin is distributing tasks across all nodes (modin.set_option("compute_mode", "cluster")). If the problem persists, try testing with smaller datasets or fewer, larger nodes.