Diagnosing unexpected memory management behavior

jhli · August 1, 2023, 4:04pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi team, I’m following this issue [Actor] Possible extra memory consumption · Issue #37291 · ray-project/ray · GitHub. The issue has not been updated for a while. I suppose it might look more like a question of Ray usage so hopefully the discussion forum is a better place to ask for help.

w.r.t. the issue itself, I have found a simpler way to reproduce it:

import ray
import numpy as np
import time
import psutil


class Driver:
    def gen(self):
        actors = [Actor1.remote() for i in range(5)]
        data = ray.get([actor.gen.remote() for actor in actors])
        np.sum(data)


@ray.remote
class Actor1:
    def gen(self):
        return np.random.rand(100000000)


if __name__ == "__main__":
    configs = {
        "memory_monitor_refresh_ms": 0,
        "memory_usage_threshold": 1,
        "free_objects_period_milliseconds": 0,
    }
    ray.init(_system_config=configs)

    driver = Driver()

    while True:
        driver.gen()
        # subprocess.run(["ray", "memory"])
        print(psutil.Process().memory_info().rss / 1024 / 1024)
        time.sleep(1)

"""
Case 1: Output of this script
$ python ray_37291.py
2023-08-01 23:31:53,972 INFO worker.py:1616 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
3927.51953125
3927.52734375
3927.53515625
...

Case 2: Output if L11 (the np.sum statement) was deleted
$ sed '11d' ray_37291.py > ray_37291_altered.py && python ray_37291_altered.py
2023-08-01 23:32:37,463 INFO worker.py:1616 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
113.59375
113.73828125
113.73828125
...
"""

Expected Results
In Case 1, RSS should still go down to ~100MB when driver.gen() is completed because no one is still holding a reference to those numpy arrays.

It seems like running np.sum on the objects returned from Actor1 pins those objects in the object store, but ray memory (by uncommenting the statement in the while loop) suggests that there are no object references to those objects.

======== Object references status: 2023-08-01 23:40:13.832909 ========
Grouping by node address...        Sorting by object size...        Display allentries per group...


To record callsite information for each ObjectRef created, set env variable RAY_record_ref_creation_sites=1

--- Aggregate object store stats across all nodes ---
Plasma memory usage 0 MiB, 0 objects, 0.0% full, 0.0% needed
Objects consumed by Ray tasks: 762 MiB.

Any insights why we’re seeing this? Many thanks!

sangcho · August 21, 2023, 1:10pm

Can you try rss - shm and see if the output is still unexpected?

Often times, workers hold shm (the handle to the mmaped file in a /dev/shm) when it uses the object store, and the object allocated this way won’t be GC’ed although the handle goes out of scope (for optimization reason). It doesn’t actually use the memory because it is mapped on shared memory.

Topic		Replies	Views
Driver memory increasing indefinitely when returning a numpy array Ray Core	0	275	December 16, 2020
Memory issue when running ray.init() Ray Core	1	474	March 23, 2022
Ray Actor RAM usage keep growing Ray Core	7	1060	June 9, 2021
Memory tracking of child processes? Ray Core	2	94	June 11, 2024
Ray consumes all my RAM Ray Core	5	508	October 18, 2021

Diagnosing unexpected memory management behavior

Related topics