How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hello everyone. I asked this question on slack as well. Asking it here to see if I can get a faster response.
We’re having a memory leak in Ray and we can’t understand where it is coming from. We tried using ray memory
to troubleshoot the issue but we were not successful.
In short, we’re running a function iter_function
over multiple iterations, and it seems that at each iteration we’re leaking some memory. Here’s pseudo-code version of what we’re doing in every iteration:
def iter_function():
ref = foo.remote(arg1, arg2, arg3, arg4, arg5)
a = ray.get(ref)
If we run, ray memory
at the end of a single iteration, we get the output here:
======== Object references status: 2023-04-11 11:24:29.142980 ========
Grouping by node address... Sorting by object size... Display allentries per group...
To record callsite information for each ObjectRef created, set env variable RAY_record_ref_creation_sites=1
--- Aggregate object store stats across all nodes ---
Plasma memory usage 86 MiB, 3 objects, 0.11% full, 0.11% needed
Objects consumed by Ray tasks: 582 MiB.
At that point, if we’re reading this right, we have 3 objects in the store occupying 86MB, however, we don’t want them there. We’re not sure what these objects are, but we believe they are copies of the large arguments we use when we call the foo.remote
function. We’re recording the callsite information but we’re not seeing where these objects are.
Thinking these could be the arguments we pass to the function, we tried to check the ray memory
output right after we step inside the foo.remote
function. Here’s the output:
======== Object references status: 2023-04-11 12:13:52.263102 ========
Grouping by node address... Sorting by object size... Display allentries per group...
--- Summary for node address: 10.68.186.140 ---
Mem Used by Objects Local References Pinned Used by task Captured in Objects Actor Handles
181.629489 MB 1, (0.0 MB) 3, (90.814736 MB) 3, (90.814753 MB) 0, (0.0 MB) 0, (0.0 MB)
--- Object references for node address: 10.68.186.140 ---
IP Address PID Type Call Site Status Size Reference Type Object Ref
10.68.186.140 108803 Driver (task call) | /home/f SCHEDULED ? LOCAL_REFERENCE c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000
austiar/SANDS/Projects
/KDN/src/FL/moon.py:pa
rallel_round_logic:70
| /home/faustiar/SANDS
/Projects/KDN/src/FL/b
ase_fl.py:excute_round
:74 | /home/faustiar/S
ANDS/Projects/KDN/src/
FL/base_fl.py:run:98
10.68.186.140 109059 Worker (deserialize task arg) - 0.150283 MB PINNED_IN_MEMORY 00ffffffffffffffffffffffffffffffffffffff0100000003000000
src.FL.utils.transfer
_ray
10.68.186.140 108803 Driver (task call) | /home/f FINISHED 0.150288 MB USED_BY_PENDING_TASK 00ffffffffffffffffffffffffffffffffffffff0100000003000000
austiar/SANDS/Projects
/KDN/src/FL/moon.py:pa
rallel_round_logic:70
| /home/faustiar/SANDS
/Projects/KDN/src/FL/b
ase_fl.py:excute_round
:74 | /home/faustiar/S
ANDS/Projects/KDN/src/
FL/base_fl.py:run:98
10.68.186.140 109059 Worker (deserialize task arg) - 45.332226 MB PINNED_IN_MEMORY 00ffffffffffffffffffffffffffffffffffffff0100000001000000
src.FL.utils.transfer
_ray
10.68.186.140 109059 Worker (deserialize task arg) - 45.332227 MB PINNED_IN_MEMORY 00ffffffffffffffffffffffffffffffffffffff0100000002000000
src.FL.utils.transfer
_ray
10.68.186.140 108803 Driver (task call) | /home/f FINISHED 45.332232 MB USED_BY_PENDING_TASK 00ffffffffffffffffffffffffffffffffffffff0100000001000000
austiar/SANDS/Projects
/KDN/src/FL/moon.py:pa
rallel_round_logic:70
| /home/faustiar/SANDS
/Projects/KDN/src/FL/b
ase_fl.py:excute_round
:74 | /home/faustiar/S
ANDS/Projects/KDN/src/
FL/base_fl.py:run:98
10.68.186.140 108803 Driver (task call) | /home/f FINISHED 45.332233 MB USED_BY_PENDING_TASK 00ffffffffffffffffffffffffffffffffffffff0100000002000000
austiar/SANDS/Projects
/KDN/src/FL/moon.py:pa
rallel_round_logic:70
| /home/faustiar/SANDS
/Projects/KDN/src/FL/b
ase_fl.py:excute_round
:74 | /home/faustiar/S
ANDS/Projects/KDN/src/
FL/base_fl.py:run:98
To record callsite information for each ObjectRef created, set env variable RAY_record_ref_creation_sites=1
--- Aggregate object store stats across all nodes ---
Plasma memory usage 86 MiB, 3 objects, 0.11% full, 0.11% needed
Objects consumed by Ray tasks: 43 MiB.
It seems to agree that it is creating 3 objects when we call the foo.remote
function and that these objects sum to the same size that we see after the iteration is complete. We believe that these arguments could be causing the memory leak, but we’re not sure how to address it.
Any help would be appreciated.
Thank you.