How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi all,
We have been experimenting with large datasets using Modin on Ray (> 50-80 GB CSVs of 20+ columns) and we haven’t had any issues with it on a vertically scaled single machine due to its capability to spill to disk by default.
I’ve been looking into horizontally scaling a remote ray cluster to handle the same tasks but on a regular basis, I’m seeing the error " <> workers killed due to memory pressure (OOM)"
Some more error logs
UserWarning: Ray Client is attempting to retrieve a 6.37 GiB object over the network, which may be slow. Consider serializing the object to a file and using S3 or rsync instead.
(raylet) [2024-12-23 02:51:28,780 E 433 433] (raylet) node_manager.cc:3069: 3 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: bc50b1b54c80c8a1a049327b661c2a1016d3aac156ca01c7b0a12d3e, IP: <ip>) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip <ip>`
(raylet)
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
[2024-12-23 10:52:06 +0000] [41] [ERROR] Worker (pid:42) was sent SIGKILL! Perhaps out of memory?
For some reference, I am able to load the dataset just fine, but any further transformations on the dataset seem to error out like above without any errors on object spilling even though I have added it as mentioned in the documentation at Object Spilling — Ray 2.40.0 I’m fairly new to ray clusters so just followed the documentation.
Below is the output of my ray status for reference to show available resources from my configuration
---------------------------------------------------------------
Usage:
0.0/88.0 CPU
0B/474.75GiB memory
0B/206.13GiB object_store_memory
Demands:
(no resource demands)
Below are details from my yaml file
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- >-
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
--system-config='{"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/tmp/spill\"}}"}'
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- >-
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
Let me know if I’m missing something, thanks in advance!