Hello team,
I figure out a problem with ray or torch 2.1 recently. But it works for any torch 1.x.x. There seems to be a memory increasing issue, although the speed of increasing is not huge but it matters a lot when the training time is long. I am not quite sure where this issue comes from, either from torch side or ray.
The minimum repreduction is below. I also tried ray memory, the Objects consumed by Ray tasks will keep increasing, but the memory usage is not quite the same as real ones in memory (htop), as ray will not show the increase.
From my diagnose, basically, each run loop of the main function, the object of ray.get will remain a portion?
Seeking for help, thanks a lot!
system
Linux Server8 6.2.0-37-generic #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 2 18:01:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
config
AMD 7950X
64G RAM
RTX 3090
import ray
import torch
import copy
import torch
from torch.utils.tensorboard import SummaryWriter
import ray
import os
import numpy as np
import random
import wandb
from scipy.stats import ttest_rel
from torch.distributions import Categorical
num_threads = 32
@ray.remote(num_cpus=1, num_gpus=1/num_threads)
class Runner(object):
"""Actor object to start running simulation on workers.
Gradient computation is also executed on this object."""
def __init__(self, ID, device='cpu'):
self.ID = ID
self.device = device
self.localNet = torch.nn.Transformer(nhead=16, num_encoder_layers=12)
def training(self, weights):
self.localNet.load_state_dict(weights)
buffer = torch.ones((10, 500, 500), dtype=torch.float32).to(self.device)
return buffer
def main():
device = torch.device('cuda')
local_device = torch.device('cpu')
ray.init()
jobs = []
global_network = torch.nn.Transformer(nhead=16, num_encoder_layers=12)
global_optimizer = torch.optim.Adam(global_network.parameters(), lr=1e-5)
lr_decay = torch.optim.lr_scheduler.StepLR(global_optimizer, step_size=2000, gamma=0.98)
weights = global_network.to(local_device).state_dict()
global_network.to(device)
weights_memory = ray.put(weights)
meta_agents = [Runner.remote(i) for i in range(num_threads)]
for i in range(num_threads):
jobs.append(meta_agents[i].training.remote(weights_memory))
try:
while True:
# wait for any job to be completed
done_id, jobs = ray.wait(jobs, num_returns=num_threads)
done_jobs = ray.get(done_id)
for i in range(num_threads):
jobs.append(meta_agents[i].training.remote(weights_memory))
print("finish one epoch")
del done_jobs, done_id
except KeyboardInterrupt:
for a in meta_agents:
ray.kill(a)
if __name__ == '__main__':
main()
======== Object references status: 2023-11-28 00:32:23.272557 ========
Grouping by node address... Sorting by object size... Display allentries per group...
--- Summary for node address: 10.248.31.18 ---
Mem Used by Objects Local References Pinned Used by task Captured in Objects Actor Handles
572324025.0 B 1, (252309471.0 B) 32, (320014554.0 B) 0, (0.0 B) 0, (0.0 B) 64, (0.0 B)
--- Object references for node address: 10.248.31.18 ---
IP Address | PID | Type | Call Site | Status | Size | Reference Type | Object Ref
10.248.31.18 | 50612 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffffaf3906cfa0a64bf4244163b90100000001000000
10.248.31.18 | 50621 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffffeef96fda8c0053856cdef8cc0100000001000000
10.248.31.18 | 50618 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff5b9182bd2a2897921470c1da0100000001000000
10.248.31.18 | 50606 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffffcba275928548db4f1f60ff970100000001000000
10.248.31.18 | 50607 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff169e7813b90b6b7382f53f120100000001000000
10.248.31.18 | 50616 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffffe75c719c2c2d4e2a8b0c215e0100000001000000
10.248.31.18 | 50609 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffffceda45918555f0958da724960100000001000000
10.248.31.18 | 50611 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffffaf8a93903362e6dd56ff788f0100000001000000
10.248.31.18 | 50604 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff800eb71999a6a95107d0cacf0100000001000000
10.248.31.18 | 50580 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff43aa2ddc3acc63df846d79760100000001000000
10.248.31.18 | 50602 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff8efbbd20659ad7e4c9b55e750100000001000000
10.248.31.18 | 50598 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff0c14f519c9386842acbd9f4f0100000001000000
10.248.31.18 | 50594 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff5763641d00649efbf1bcef8c0100000001000000
10.248.31.18 | 50600 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff2f22a7e0b892d80c941466840100000001000000
10.248.31.18 | 50587 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff4b8660ce18ee40768e42a49a0100000001000000
10.248.31.18 | 50589 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff9ebd1af037d9ac3ae78424e20100000001000000
10.248.31.18 | 50614 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff53da1bf5107891ef2f1a18950100000001000000
10.248.31.18 | 50597 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff6aa054fde36cb1688c4236600100000001000000
10.248.31.18 | 50591 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff7c2896f7e8141550815e20a90100000001000000
10.248.31.18 | 50583 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff855d294f343d049612d07b440100000001000000
10.248.31.18 | 50581 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffffe5431f85f2d1ba5ee8f9b1d20100000001000000
10.248.31.18 | 50578 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff9abfda560219442e5a8832190100000001000000
10.248.31.18 | 50575 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffffa552ebbfcb61f2823b1e82f70100000001000000
10.248.31.18 | 50577 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff028046552c63096ddb3f39710100000001000000
10.248.31.18 | 50579 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff770a6b8c7c9aa3e869702f260100000001000000
10.248.31.18 | 50573 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffffc16665db2e4e24eed0dbe9340100000001000000
10.248.31.18 | 50623 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff1e7d1b2149919bf27619a91b0100000001000000
10.248.31.18 | 50593 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffff91e7626f62853f0d49f5fd930100000001000000
10.248.31.18 | 50624 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffffd1b6ca57aab9ef8c6d75f2fb0100000001000000
10.248.31.18 | 50585 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffffa332240affbf17a7e4f39c000100000001000000
10.248.31.18 | 50576 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffffa675cddf3fb844a80b329b380100000001000000
10.248.31.18 | 50574 | Worker | disabled | - | ? | ACTOR_HANDLE | ffffffffffffffffbdf4a5645ba401c01fd798ed0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffffaf8a93903362e6dd56ff788f0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff9abfda560219442e5a8832190100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff1e7d1b2149919bf27619a91b0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff855d294f343d049612d07b440100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffffcba275928548db4f1f60ff970100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffffe75c719c2c2d4e2a8b0c215e0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff4b8660ce18ee40768e42a49a0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff5763641d00649efbf1bcef8c0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff2f22a7e0b892d80c941466840100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffffd1b6ca57aab9ef8c6d75f2fb0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff8efbbd20659ad7e4c9b55e750100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff9ebd1af037d9ac3ae78424e20100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff91e7626f62853f0d49f5fd930100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff169e7813b90b6b7382f53f120100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff43aa2ddc3acc63df846d79760100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffffaf3906cfa0a64bf4244163b90100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff53da1bf5107891ef2f1a18950100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff800eb71999a6a95107d0cacf0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff6aa054fde36cb1688c4236600100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffffa675cddf3fb844a80b329b380100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffffceda45918555f0958da724960100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffffa552ebbfcb61f2823b1e82f70100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffffe5431f85f2d1ba5ee8f9b1d20100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffffeef96fda8c0053856cdef8cc0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff5b9182bd2a2897921470c1da0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffffbdf4a5645ba401c01fd798ed0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffffa332240affbf17a7e4f39c000100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff028046552c63096ddb3f39710100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffffc16665db2e4e24eed0dbe9340100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff0c14f519c9386842acbd9f4f0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff770a6b8c7c9aa3e869702f260100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | ? | ACTOR_HANDLE | ffffffffffffffff7c2896f7e8141550815e20a90100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000453.0 B | PINNED_IN_MEMORY | 55d883187917353aeef96fda8c0053856cdef8cc0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000453.0 B | PINNED_IN_MEMORY | 434b4fb1a8fd85cdaf3906cfa0a64bf4244163b90100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000453.0 B | PINNED_IN_MEMORY | 3d4de4ac5e50d22b43aa2ddc3acc63df846d79760100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | e3eb48a9d2572df42f22a7e0b892d80c941466840100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | e6ac4a962e7ac94f91e7626f62853f0d49f5fd930100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | a55ccc19cc20f1bc028046552c63096ddb3f39710100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | a5d07deeea48da8f169e7813b90b6b7382f53f120100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 25cb98df7a40cf1d53da1bf5107891ef2f1a18950100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 136995562f3fb3df855d294f343d049612d07b440100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | da7d203db1c7744b6aa054fde36cb1688c4236600100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 81d5d95e8d8226e6c16665db2e4e24eed0dbe9340100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 2a4e7cc14741d2d8bdf4a5645ba401c01fd798ed0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 590d9f6dd1eebb1e5b9182bd2a2897921470c1da0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 99a8fdd6b74e70647c2896f7e8141550815e20a90100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 3f9becfb57f03e84a675cddf3fb844a80b329b380100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 0f0b5fb44f317484a332240affbf17a7e4f39c000100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 1d90531bb1765df34b8660ce18ee40768e42a49a0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 834c36488f4f7b3f9ebd1af037d9ac3ae78424e20100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | f5477fe567481aebe75c719c2c2d4e2a8b0c215e0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 21e63b74ead222a0a552ebbfcb61f2823b1e82f70100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 9dd82f25405e2722770a6b8c7c9aa3e869702f260100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 9fac11437dba2d245763641d00649efbf1bcef8c0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | b7b0d30112dcb2d4cba275928548db4f1f60ff970100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 87058628a154b8aed1b6ca57aab9ef8c6d75f2fb0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | c11fdb4911cf29628efbbd20659ad7e4c9b55e750100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 3e425fa654abd7fc9abfda560219442e5a8832190100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 097f0b473087017eaf8a93903362e6dd56ff788f0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | d5f0cf47c4c3e093e5431f85f2d1ba5ee8f9b1d20100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | bb39a6f6324d677b1e7d1b2149919bf27619a91b0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | d5b0461782247a31800eb71999a6a95107d0cacf0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 56e09a8db21fea5cceda45918555f0958da724960100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 10000455.0 B | PINNED_IN_MEMORY | 1a8bd047988b60950c14f519c9386842acbd9f4f0100000001000000
10.248.31.18 | 49135 | Driver | disabled | FINISHED | 252309471.0 B | LOCAL_REFERENCE | 00ffffffffffffffffffffffffffffffffffffff0100000001e1f505
To record callsite information for each ObjectRef created, set env variable RAY_record_ref_creation_sites=1
--- Aggregate object store stats across all nodes ---
Plasma memory usage 545 MiB, 33 objects, 2.93% full, 2.93% needed
Objects consumed by Ray tasks: 5388878 MiB.
How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.