Weird error logs when running Out Of Memory (OOM)

How severe does this issue affect your experience of using Ray?
Medium to High: not exactly sure I can find a scalable workaround

I’m trying to run a ray[tune] session (not sure this is important, but it’s driven from Microsoft’s flaml library, which uses an older API) for a custom optimization objective. I’m running it on a cluster of aws machines. I use three nodes, each having 64 cores and 128 GB of memory. I’m assigning one cpu core per task, so I’m running 64 tasks in parallel on each virtual machine.

I’m not 100% sure my app works right (i.e., that it doesn’t have any memory leaks), but normally its memory footprint shouldn’t depend on the size of the data it’s running on (at least not more than marginally).

However, after recently running it on some new data (which is a bit bigger than before - but as I mentioned, this isn’t supposed to matter much), the optimization keeps crashing. The first crashes had a more cryptic output (something like - it can be OOM or SIGSEGV or other unspecified errors), but now I get some clearer entries in the log (in the console), in which I can see how the node’s memory consumption gradually increases and then it crashes. Before, maybe the increase in memory was more abrupt or something, and it crashed without logging.

Here’s the console printout:

e[2me[33m(raylet)e[0m [2023-03-26 04:01:09,343 E 7088 7088] (raylet) node_manager.cc:3040: 1 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 75b5c6436c2fe3ad476a5fd2f352e6c59f94e4f7e4e00c849d704c3c, IP: 172.31.13.0) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 172.31.13.0`
e[2me[33m(raylet)e[0m
e[2me[33m(raylet)e[0m Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
2023-03-26 04:01:43,353 ERROR trial_runner.py:1551 -- Trial evaluate_config_b7162900: Error stopping trial.
Traceback (most recent call last):
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/execution/trial_runner.py", line 1544, in stop_trial
    self._callbacks.on_trial_complete(
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/callback.py", line 360, in on_trial_complete
    callback.on_trial_complete(**info)
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/syncer.py", line 731, in on_trial_complete
    self._sync_trial_dir(trial, force=True, wait=True)
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/syncer.py", line 703, in _sync_trial_dir
    sync_process.wait()
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/syncer.py", line 237, in wait
    raise exception
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/syncer.py", line 200, in entrypoint
    result = self._fn(*args, **kwargs)
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/utils/file_transfer.py", line 69, in sync_dir_between_nodes
    return _sync_dir_between_different_nodes(
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/tune/utils/file_transfer.py", line 197, in _sync_dir_between_different_nodes
    return ray.get(unpack_future)
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/_private/worker.py", line 2382, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 172.31.13.0, ID: 75b5c6436c2fe3ad476a5fd2f352e6c59f94e4f7e4e00c849d704c3c) where the task (task ID: 505a966cce7b56cf0860564245b9b628091e031201000000, name=_unpack_from_actor, pid=53880, memory used=0.08GB) was running was 117.40GB / 123.55GB (0.950225), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 4e15dc6b6c58469c4ee06dcd8bf91b3488e3e53d37ca80e1fb1e3eeb) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 172.31.13.0`. To see the logs of the worker, use `ray logs worker-4e15dc6b6c58469c4ee06dcd8bf91b3488e3e53d37ca80e1fb1e3eeb*out -ip 172.31.13.0. Top 10 memory users:
PID     MEM(GB) COMMAND
7088    83.29   /home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_n...
7468    9.00    python3 -m black_box_optimization.using_flaml -q /home/ec2-user/data -c /...
6677    1.60    /home/ec2-user/pyvenv/lib64/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/r...
7994    0.38    ray::ImplicitFunc
9009    0.35    ray::ImplicitFunc
8682    0.34    ray::ImplicitFunc
8021    0.33    ray::ImplicitFunc
9487    0.33    ray::ImplicitFunc
8024    0.33    ray::ImplicitFunc
7993    0.33    ray::ImplicitFunc
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

Note: the console printout above is identical to the contents of the error.txt file of the failing trial.

Now, the part I’m most confused about is the memory hogs table at the end. Ok, the fact that it says that my app requires 9 GB of memory to run rasises some questions and suggests that there might be an error in there, as well. However, what’s going on with the 83.29 GB that are used by the raylet?

As far as I can see in the docs, the raylet’s memory should be “typically quit small”. Does this include the memory used Object Store? I mean, I don’t see any spillage reported, and I don’t consciously use it almost at all (I understand args and results are passed through it, but to my knowledge, those are tiny for the tasks I’m running).

Now, I can’t run

ray logs worker-4e15dc6b6c58469c4ee06dcd8bf91b3488e3e53d37ca80e1fb1e3eeb*out -ip 172.31.13.0

as suggested in the error message, since the aws machines have been shut down since the error occurred. However, I was able to save the contents of the /tmp/ray/* of each node. I’ve sifted a bit through the worker-specific log files (python-core-worker-4e15dc6b6c58469c4ee06dcd8bf91b3488e3e53d37ca80e1fb1e3eeb_53880.log, worker-4e15dc6b6c58469c4ee06dcd8bf91b3488e3e53d37ca80e1fb1e3eeb-01000000-53880.err, worker-4e15dc6b6c58469c4ee06dcd8bf91b3488e3e53d37ca80e1fb1e3eeb-01000000-53880.out in /tmp/ray/session_latest/logs/), but wasn’t able to find anything of help.

I can see there’s some occupancy data in monitor.log (which I see is the same stuff I get if I run ray monitor cluster.yaml):

2023-03-26 04:01:16,843	INFO autoscaler.py:419 --
======== Autoscaler status: 2023-03-26 04:01:16.840452 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
 2 ray.worker.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 191.0/192.0 CPU (192.0 used of 193.0 reserved in placement groups)
 0.00/255.885 GiB memory
 0.00/110.779 GiB object_store_memory

Demands:
 {'CPU': 1.0} * 1 (PACK): 1+ pending placement groups
2023-03-26 04:01:16,851	INFO autoscaler.py:462 -- The autoscaler took 0.067 seconds to complete the update iteration.
2023-03-26 04:01:21,944	INFO autoscaler.py:143 -- The autoscaler took 0.056 seconds to fetch the list of non-terminated nodes.
2023-03-26 04:01:21,949	INFO autoscaler.py:419 --
======== Autoscaler status: 2023-03-26 04:01:21.946023 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
 2 ray.worker.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 191.0/192.0 CPU (192.0 used of 193.0 reserved in placement groups)
 0.00/255.885 GiB memory
 0.00/110.779 GiB object_store_memory

Demands:
 {'CPU': 1.0} * 1 (PACK): 1+ pending placement groups
2023-03-26 04:01:21,957	INFO autoscaler.py:462 -- The autoscaler took 0.069 seconds to complete the update iteration.
2023-03-26 04:01:27,264	INFO autoscaler.py:143 -- The autoscaler took 0.27 seconds to fetch the list of non-terminated nodes.

Here I don’t really understand the memory usage stats. I.e., are 119,779GiB object_store_memory reserved, but not used? Can I use that memory for something else, e.g., running my code instead?

This is one of the latest healthy updates. After the OOM, the number of CPUS in use drops and then remains at zero.

Also, I looked in raylet.out and I can’t say I understood much of it, but I found some enrtries that look like this (which I also find weird):

[state-dump] Local id: 8515322581423151276 Local resources: {memory: [823395780610000]/[823395780610000], node:172.31.13.0: [10000]/[10000], CPU: [640000]/[640000], object_store_memory: [395741048830000]/[395741048830000]}node id: 2072714893315002763{memory: 962072674300000/962072674300000, node:172.31.15.79: 10000/10000, CPU: 640000/640000, object_store_memory: 396871999480000/396871999480000}node id: 8515322581423151276{CPU: 640000/640000, object_store_memory: 395741048830000/395741048830000, node:172.31.13.0: 10000/10000, memory: 823395780610000/823395780610000}node id: -7427968395146788597{object_store_memory: 396868755450000/396868755450000, bundle_group_0_7e36e8882e89945960e107bb436201000000: 10000000/10000000, node:172.31.4.21: 10000/10000, bundle_group_7e36e8882e89945960e107bb436201000000: 10000000/10000000, memory: 962072674300000/962072674300000, CPU_group_7e36e8882e89945960e107bb436201000000: 10000/10000, CPU_group_0_7e36e8882e89945960e107bb436201000000: 10000/10000, CPU: 640000/640000}{ "placment group locations": [], "node to bundles": []}

Also, is this condition connected to this other issue?

So, what can I do to get this running, other than fixing my own code :)?

0.00/110.779 GiB object_store_memory – are you using the object store? You could try running Ray with less of your clusters’ memory allocated for object transfer (Cluster Management CLI — Ray 2.3.0).

83GB used by the raylet seems like a lot to me. It’s hard to say with just that number, though – seeing RSS vs shared in something like ps would be ideal. Are you able to poll and get resident set size of the raylet?

Ray will kill tasks that consume too much memory in an effort to obtain completion of the overall job, in a component called the OOM killer. It’s a best-effort heuristic which isn’t guaranteed to remove thrashing. Since I don’t understand your workload very well (and what expectations you have for memory usage), I can’t say if twiddling with the OOM killer configuration will help you. But maybe reading this can give you more insight into the problem. Out-Of-Memory Prevention — Ray 2.3.0

Probably worth sharing code as well, might answer a lot of other questions :slight_smile:

are you using the object store?

No. I mean, as I said (but perhaps I wasn’t able to convey my thoughts very coherently before - sorry about that), I know it’s used to pass parameters and retrieve return values to and from the worker nodes, but other than that, no. Does the 0.00 in 0.00/110.779 GiB prove that to be the case?

Anyway, how does this work? Is this memory reserved exclusively for the object store? I mean, will it prevent my app from using that memory even if the object store only needs far less? And how did it get to 111 GiB? According to the docs you’ve linked above, it’s capped at 20GB, and for 3 workers, shouldn’t this amount only up to 60 GiB (and that only if it’s 20GB per worker node, not overall)?

But maybe reading this can give you more insight into the problem. Out-Of-Memory Prevention — Ray 2.3.0

I have read this before posting my question, and now I have read it again. I don’t want to disable the memory monitor, and I don’t want to retry failed trials that have resulted in OOM, since they will hit OOM again and the node will crash anyway.

I’m currently trying to identify whether or not my own code bloats anywhere, but I also noticed the very high memory requirements of the raylet, and I can imagine either or both of these factors might lead to OOM. That’s why I’m, basically, asking whether there’s a memory leak in the raylet when there are many short tasks to run.

Probably worth sharing code as well, might answer a lot of other questions :slight_smile:

I can’t share the production code because I don’t own it, but I can tell you, as above, that the function to optimize is CPU-only, and is rather quick to evaluate (it’s not, e.g., a neural net training that takes many hours or days), so the pattern is trying lots of parameter configurations quickly. To give an indication of the order of magnitude, 34k trials are run on 192 cores in 39 hours. Some of them are early stopped by ASHA.

  • Yes, the memory is reserved exclusively for the object store. In systems parlance, the memory is page-locked and mapped into the Python processes to enable efficient object copy and retrieval. This setup is ideal for large-scale data-processing workloads; for workloads where each worker doesn’t need to communicate large amounts of data with other workers one can use less object store memory than the default.
  • How did it get to 111 GiB? I am not sure what configuration you’re using (one can override the 20GB/node cap). I recommend setting the object store size to something that fits your workload via Cluster Management CLI — Ray 2.8.0 (--object-store-memory).

I’m currently trying to identify whether or not my own code bloats anywhere, but I also noticed the very high memory requirements of the raylet, and I can imagine either or both of these factors might lead to OOM. That’s why I’m, basically, asking whether there’s a memory leak in the raylet when there are many short tasks to run.

  • To answer directly your question, we don’t know of any memory leak like this in the raylet. We run a weekly test that create 10k tasks over 4096 CPUs. If you observe monotonically increasing RSS in any Ray process, we’d of course love to fix it.

can’t share the production code because I don’t own it, but I can tell you, as above, that the function to optimize is CPU-only, and is rather quick to evaluate (it’s not, e.g., a neural net training that takes many hours or days), so the pattern is trying lots of parameter configurations quickly. To give an indication of the order of magnitude, 34k trials are run on 192 cores in 39 hours. Some of them are early stopped by ASHA.

  • Hard to say without code but definitely possible for memory leaks to happen. To know for sure would have to have some repro that you could share.

How did it get to 111 GiB? I am not sure what configuration you’re using (one can override the 20GB/node cap). I recommend setting the object store size to something that fits your workload via Cluster Management CLI — Ray 2.3.0 (--object-store-memory).

I got an answer for the 111GB. According to these docs, when using yaml configs for aws, it’s defaulting to 30% of the available memory.

1 Like

Hi again and sorry for the long pause,

In the mean time I was able to rewrite some of the places that might have caused a mem leak to happen in my code, tried more aggressively to convince the garbage collector to do its job, reserved less memory for the object store and after all these attempts of circumvention, I still believe there’s a memory leak somewhere. I can’t swear it’s not in my own code (I haven’t run a proper memory profiling just yet), although I think there’s some evidence that my code is not the problem (i.e., I’ve used a regular multiprocessing.Pool to run a comparable number of tasks, and it didn’t run out of memory, as with ray).

To answer directly your question, we don’t know of any memory leak like this in the raylet. We run a weekly test that create 10k tasks over 4096 CPUs. If you observe monotonically increasing RSS in any Ray process, we’d of course love to fix it.

When looking at ray’s status messages in the console, the memory usage reported slowly increases from 20 to 60 GiB (not sure whether this is RSS or not) while running for about 20 hours on 192 cpus (one head and 2 workers in an aws cluster), time in which it was able to finish almost 30k trials.

First status:

== Status ==
Current time: 2023-04-07 16:59:15 (running for 00:00:02.06)
Memory usage on this node: 2.5/123.6 GiB 
Using AsyncHyperBand: num_stopped=0

Then, soon enough, it stabilizes to stuff like this:

== Status ==
Current time: 2023-04-07 17:02:04 (running for 00:02:51.52)
Memory usage on this node: 18.0/123.6 GiB 
Using AsyncHyperBand: num_stopped=135

And then it slowly, but rather steadily (there is a bit of oscillation noise, but the trend is perfectly obvious), increases, so this ends up being the last status:

== Status ==
Current time: 2023-04-08 13:17:21 (running for 20:18:07.68)
Memory usage on this node: 59.3/123.6 GiB 
Using AsyncHyperBand: num_stopped=29361

Also, output of ray monitor aws.yaml still shows similar outputs as before,

Resources
---------------------------------------------------------------
Usage:
 191.0/192.0 CPU (192.0 used of 193.0 reserved in placement groups)
 0.00/255.885 GiB memory
 0.00/110.779 GiB object_store_memory

(although now instead of 111 GiB, I noticed there were much fewer - something like 12 GiB), but still it reported 0 GiB memory used, and I’m not sure what that’s supposed to mean, either.