How severe does this issue affect your experience of using Ray?
Medium to High: not exactly sure I can find a scalable workaround
I’m trying to run a ray[tune]
session (not sure this is important, but it’s driven from Microsoft’s flaml
library, which uses an older API) for a custom optimization objective. I’m running it on a cluster of aws machines. I use three nodes, each having 64 cores and 128 GB of memory. I’m assigning one cpu core per task, so I’m running 64 tasks in parallel on each virtual machine.
I’m not 100% sure my app works right (i.e., that it doesn’t have any memory leaks), but normally its memory footprint shouldn’t depend on the size of the data it’s running on (at least not more than marginally).
However, after recently running it on some new data (which is a bit bigger than before - but as I mentioned, this isn’t supposed to matter much), the optimization keeps crashing. The first crashes had a more cryptic output (something like - it can be OOM or SIGSEGV or other unspecified errors), but now I get some clearer entries in the log (in the console), in which I can see how the node’s memory consumption gradually increases and then it crashes. Before, maybe the increase in memory was more abrupt or something, and it crashed without logging.
Here’s the console printout:
e[2me[33m(raylet)e[0m [2023-03-26 04:01:09,343 E 7088 7088] (raylet) node_manager.cc:3040: 1 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 75b5c6436c2fe3ad476a5fd2f352e6c59f94e4f7e4e00c849d704c3c, IP: 172.31.13.0) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 172.31.13.0`
e[2me[33m(raylet)e[0m
e[2me[33m(raylet)e[0m Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
2023-03-26 04:01:43,353 ERROR trial_runner.py:1551 -- Trial evaluate_config_b7162900: Error stopping trial.
Traceback (most recent call last):
File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 1544, in stop_trial
self._callbacks.on_trial_complete(
File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/callback.py", line 360, in on_trial_complete
callback.on_trial_complete(**info)
File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/syncer.py", line 731, in on_trial_complete
self._sync_trial_dir(trial, force=True, wait=True)
File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/syncer.py", line 703, in _sync_trial_dir
sync_process.wait()
File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/syncer.py", line 237, in wait
raise exception
File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/syncer.py", line 200, in entrypoint
result = self._fn(*args, **kwargs)
File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/utils/file_transfer.py", line 69, in sync_dir_between_nodes
return _sync_dir_between_different_nodes(
File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/utils/file_transfer.py", line 197, in _sync_dir_between_different_nodes
return ray.get(unpack_future)
File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/_private/worker.py", line 2382, in get
raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 172.31.13.0, ID: 75b5c6436c2fe3ad476a5fd2f352e6c59f94e4f7e4e00c849d704c3c) where the task (task ID: 505a966cce7b56cf0860564245b9b628091e031201000000, name=_unpack_from_actor, pid=53880, memory used=0.08GB) was running was 117.40GB / 123.55GB (0.950225), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 4e15dc6b6c58469c4ee06dcd8bf91b3488e3e53d37ca80e1fb1e3eeb) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 172.31.13.0`. To see the logs of the worker, use `ray logs worker-4e15dc6b6c58469c4ee06dcd8bf91b3488e3e53d37ca80e1fb1e3eeb*out -ip 172.31.13.0. Top 10 memory users:
PID MEM(GB) COMMAND
7088 83.29 /home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_n...
7468 9.00 python3 -m black_box_optimization.using_flaml -q /home/ec2-user/data -c /...
6677 1.60 /home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/r...
7994 0.38 ray::ImplicitFunc
9009 0.35 ray::ImplicitFunc
8682 0.34 ray::ImplicitFunc
8021 0.33 ray::ImplicitFunc
9487 0.33 ray::ImplicitFunc
8024 0.33 ray::ImplicitFunc
7993 0.33 ray::ImplicitFunc
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
Note: the console printout above is identical to the contents of the error.txt
file of the failing trial.
Now, the part I’m most confused about is the memory hogs table at the end. Ok, the fact that it says that my app requires 9 GB
of memory to run rasises some questions and suggests that there might be an error in there, as well. However, what’s going on with the 83.29 GB
that are used by the raylet
?
As far as I can see in the docs, the raylet
’s memory should be “typically quit small”. Does this include the memory used Object Store? I mean, I don’t see any spillage reported, and I don’t consciously use it almost at all (I understand args and results are passed through it, but to my knowledge, those are tiny for the tasks I’m running).
Now, I can’t run
ray logs worker-4e15dc6b6c58469c4ee06dcd8bf91b3488e3e53d37ca80e1fb1e3eeb*out -ip 172.31.13.0
as suggested in the error message, since the aws machines have been shut down since the error occurred. However, I was able to save the contents of the /tmp/ray/*
of each node. I’ve sifted a bit through the worker-specific log files (python-core-worker-4e15dc6b6c58469c4ee06dcd8bf91b3488e3e53d37ca80e1fb1e3eeb_53880.log
, worker-4e15dc6b6c58469c4ee06dcd8bf91b3488e3e53d37ca80e1fb1e3eeb-01000000-53880.err
, worker-4e15dc6b6c58469c4ee06dcd8bf91b3488e3e53d37ca80e1fb1e3eeb-01000000-53880.out
in /tmp/ray/session_latest/logs/
), but wasn’t able to find anything of help.
I can see there’s some occupancy data in monitor.log
(which I see is the same stuff I get if I run ray monitor cluster.yaml
):
2023-03-26 04:01:16,843 INFO autoscaler.py:419 --
======== Autoscaler status: 2023-03-26 04:01:16.840452 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray.head.default
2 ray.worker.default
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
191.0/192.0 CPU (192.0 used of 193.0 reserved in placement groups)
0.00/255.885 GiB memory
0.00/110.779 GiB object_store_memory
Demands:
{'CPU': 1.0} * 1 (PACK): 1+ pending placement groups
2023-03-26 04:01:16,851 INFO autoscaler.py:462 -- The autoscaler took 0.067 seconds to complete the update iteration.
2023-03-26 04:01:21,944 INFO autoscaler.py:143 -- The autoscaler took 0.056 seconds to fetch the list of non-terminated nodes.
2023-03-26 04:01:21,949 INFO autoscaler.py:419 --
======== Autoscaler status: 2023-03-26 04:01:21.946023 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray.head.default
2 ray.worker.default
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
191.0/192.0 CPU (192.0 used of 193.0 reserved in placement groups)
0.00/255.885 GiB memory
0.00/110.779 GiB object_store_memory
Demands:
{'CPU': 1.0} * 1 (PACK): 1+ pending placement groups
2023-03-26 04:01:21,957 INFO autoscaler.py:462 -- The autoscaler took 0.069 seconds to complete the update iteration.
2023-03-26 04:01:27,264 INFO autoscaler.py:143 -- The autoscaler took 0.27 seconds to fetch the list of non-terminated nodes.
Here I don’t really understand the memory usage stats. I.e., are 119,779GiB object_store_memory
reserved, but not used? Can I use that memory for something else, e.g., running my code instead?
This is one of the latest healthy updates. After the OOM, the number of CPUS in use drops and then remains at zero.
Also, I looked in raylet.out
and I can’t say I understood much of it, but I found some enrtries that look like this (which I also find weird):
[state-dump] Local id: 8515322581423151276 Local resources: {memory: [823395780610000]/[823395780610000], node:172.31.13.0: [10000]/[10000], CPU: [640000]/[640000], object_store_memory: [395741048830000]/[395741048830000]}node id: 2072714893315002763{memory: 962072674300000/962072674300000, node:172.31.15.79: 10000/10000, CPU: 640000/640000, object_store_memory: 396871999480000/396871999480000}node id: 8515322581423151276{CPU: 640000/640000, object_store_memory: 395741048830000/395741048830000, node:172.31.13.0: 10000/10000, memory: 823395780610000/823395780610000}node id: -7427968395146788597{object_store_memory: 396868755450000/396868755450000, bundle_group_0_7e36e8882e89945960e107bb436201000000: 10000000/10000000, node:172.31.4.21: 10000/10000, bundle_group_7e36e8882e89945960e107bb436201000000: 10000000/10000000, memory: 962072674300000/962072674300000, CPU_group_7e36e8882e89945960e107bb436201000000: 10000/10000, CPU_group_0_7e36e8882e89945960e107bb436201000000: 10000/10000, CPU: 640000/640000}{ "placment group locations": [], "node to bundles": []}
Also, is this condition connected to this other issue?
So, what can I do to get this running, other than fixing my own code :)?