Ray memory leak from arguments to .remote function?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello everyone. I asked this question on slack as well. Asking it here to see if I can get a faster response.

We’re having a memory leak in Ray and we can’t understand where it is coming from. We tried using ray memory to troubleshoot the issue but we were not successful.
In short, we’re running a function iter_function over multiple iterations, and it seems that at each iteration we’re leaking some memory. Here’s pseudo-code version of what we’re doing in every iteration:

def iter_function(): 
    ref = foo.remote(arg1, arg2, arg3, arg4, arg5) 
    a = ray.get(ref)

If we run, ray memory at the end of a single iteration, we get the output here:

======== Object references status: 2023-04-11 11:24:29.142980 ========
Grouping by node address...        Sorting by object size...        Display allentries per group...


To record callsite information for each ObjectRef created, set env variable RAY_record_ref_creation_sites=1

--- Aggregate object store stats across all nodes ---
Plasma memory usage 86 MiB, 3 objects, 0.11% full, 0.11% needed
Objects consumed by Ray tasks: 582 MiB.

At that point, if we’re reading this right, we have 3 objects in the store occupying 86MB, however, we don’t want them there. We’re not sure what these objects are, but we believe they are copies of the large arguments we use when we call the foo.remote function. We’re recording the callsite information but we’re not seeing where these objects are.

Thinking these could be the arguments we pass to the function, we tried to check the ray memory output right after we step inside the foo.remote function. Here’s the output:

======== Object references status: 2023-04-11 12:13:52.263102 ========
Grouping by node address...        Sorting by object size...        Display allentries per group...


--- Summary for node address: 10.68.186.140 ---
Mem Used by Objects  Local References  Pinned        Used by task   Captured in Objects  Actor Handles
181.629489 MB        1, (0.0 MB)       3, (90.814736 MB)  3, (90.814753 MB)  0, (0.0 MB)          0, (0.0 MB)

--- Object references for node address: 10.68.186.140 ---
IP Address       PID    Type    Call Site               Status          Size    Reference Type      Object Ref
10.68.186.140    108803  Driver  (task call)  | /home/f  SCHEDULED       ?       LOCAL_REFERENCE     c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000
                                austiar/SANDS/Projects
                                /KDN/src/FL/moon.py:pa
                                rallel_round_logic:70
                                | /home/faustiar/SANDS
                                /Projects/KDN/src/FL/b
                                ase_fl.py:excute_round
                                :74 | /home/faustiar/S
                                ANDS/Projects/KDN/src/
                                FL/base_fl.py:run:98

10.68.186.140    109059  Worker  (deserialize task arg)  -               0.150283 MB  PINNED_IN_MEMORY    00ffffffffffffffffffffffffffffffffffffff0100000003000000
                                 src.FL.utils.transfer
                                _ray

10.68.186.140    108803  Driver  (task call)  | /home/f  FINISHED        0.150288 MB  USED_BY_PENDING_TASK  00ffffffffffffffffffffffffffffffffffffff0100000003000000
                                austiar/SANDS/Projects
                                /KDN/src/FL/moon.py:pa
                                rallel_round_logic:70
                                | /home/faustiar/SANDS
                                /Projects/KDN/src/FL/b
                                ase_fl.py:excute_round
                                :74 | /home/faustiar/S
                                ANDS/Projects/KDN/src/
                                FL/base_fl.py:run:98

10.68.186.140    109059  Worker  (deserialize task arg)  -               45.332226 MB  PINNED_IN_MEMORY    00ffffffffffffffffffffffffffffffffffffff0100000001000000
                                 src.FL.utils.transfer
                                _ray

10.68.186.140    109059  Worker  (deserialize task arg)  -               45.332227 MB  PINNED_IN_MEMORY    00ffffffffffffffffffffffffffffffffffffff0100000002000000
                                 src.FL.utils.transfer
                                _ray

10.68.186.140    108803  Driver  (task call)  | /home/f  FINISHED        45.332232 MB  USED_BY_PENDING_TASK  00ffffffffffffffffffffffffffffffffffffff0100000001000000
                                austiar/SANDS/Projects
                                /KDN/src/FL/moon.py:pa
                                rallel_round_logic:70
                                | /home/faustiar/SANDS
                                /Projects/KDN/src/FL/b
                                ase_fl.py:excute_round
                                :74 | /home/faustiar/S
                                ANDS/Projects/KDN/src/
                                FL/base_fl.py:run:98

10.68.186.140    108803  Driver  (task call)  | /home/f  FINISHED        45.332233 MB  USED_BY_PENDING_TASK  00ffffffffffffffffffffffffffffffffffffff0100000002000000
                                austiar/SANDS/Projects
                                /KDN/src/FL/moon.py:pa
                                rallel_round_logic:70
                                | /home/faustiar/SANDS
                                /Projects/KDN/src/FL/b
                                ase_fl.py:excute_round
                                :74 | /home/faustiar/S
                                ANDS/Projects/KDN/src/
                                FL/base_fl.py:run:98

To record callsite information for each ObjectRef created, set env variable RAY_record_ref_creation_sites=1

--- Aggregate object store stats across all nodes ---
Plasma memory usage 86 MiB, 3 objects, 0.11% full, 0.11% needed
Objects consumed by Ray tasks: 43 MiB.

It seems to agree that it is creating 3 objects when we call the foo.remote function and that these objects sum to the same size that we see after the iteration is complete. We believe that these arguments could be causing the memory leak, but we’re not sure how to address it.

Any help would be appreciated.
Thank you.