Remote ray cluster not spilling to disk

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi all,
We have been experimenting with large datasets using Modin on Ray (> 50-80 GB CSVs of 20+ columns) and we haven’t had any issues with it on a vertically scaled single machine due to its capability to spill to disk by default.

I’ve been looking into horizontally scaling a remote ray cluster to handle the same tasks but on a regular basis, I’m seeing the error " <> workers killed due to memory pressure (OOM)"

Some more error logs

UserWarning: Ray Client is attempting to retrieve a 6.37 GiB object over the network, which may be slow. Consider serializing the object to a file and using S3 or rsync instead.
(raylet) [2024-12-23 02:51:28,780 E 433 433] (raylet) node_manager.cc:3069: 3 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: bc50b1b54c80c8a1a049327b661c2a1016d3aac156ca01c7b0a12d3e, IP: <ip>) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip <ip>`
(raylet) 
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
[2024-12-23 10:52:06 +0000] [41] [ERROR] Worker (pid:42) was sent SIGKILL! Perhaps out of memory?

For some reference, I am able to load the dataset just fine, but any further transformations on the dataset seem to error out like above without any errors on object spilling even though I have added it as mentioned in the documentation at Object Spilling — Ray 2.40.0 I’m fairly new to ray clusters so just followed the documentation.

Below is the output of my ray status for reference to show available resources from my configuration

---------------------------------------------------------------
Usage:
 0.0/88.0 CPU
 0B/474.75GiB memory
 0B/206.13GiB object_store_memory

Demands:
 (no resource demands)

Below are details from my yaml file

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
      --system-config='{"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/tmp/spill\"}}"}'


# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Let me know if I’m missing something, thanks in advance!

Thanks for raising the issue, @Lichtz!

Actually, workers killed due to OOM could happen because of many different situations. And OOM can also happen even if the Object spill is set. You will probably need to figure out which situation it is and then to see why this doesn’t happen in the single node case.

To provide more context, Ray has a Memory Monitor that checks the node memory usage to prevent Linux OOM killer kills the Ray processes. At the same time, the Object spilling feature of Ray executes upon the object_store_memory limitation and it only applies to the large objects. In this sense, if there are a lot of small objects, the Ray’s Memory Monitor can be triggered to kill the ongoing worker process.

To further troubleshooting the memory issue, you can follow the doc here: Debugging Memory Issues — Ray 2.40.0 to get more information regarding the memory usage. And to monitor the object spill activity, you can follow the steps here: Object Spilling — Ray 2.40.0.

Let us know if you collect more information or have further questions.