hi,
I am using Ray 1.3 for scaling up my rl codes, when my code runs several tens of minutes, the RAM of my server will be exhausted. The memory usage keeps growing while my code is runing. The simplified code like this:
import ray
import gc
import numpy as np
import time
ray.init(address="auto")
@ray.remote(memory=1 * 1024 * 1024 * 1024)
class Actor:
def get_data(self):
data = np.arange(6400).reshape((64, 100))
d = {}
d["data"] = data
return d
def get_episode_data(self):
data = []
for i in range(100 * 1):
data.append(self.get_data())
return data
@ray.remote(memory=2 * 1024 * 1024 * 1024)
class learner:
def __init__(self):
self.count = 0
self.actor = Actor.remote()
self.data = []
def get_data(self):
d = self.actor.get_episode_data.remote()
return d
def step(self):
data = self.get_data()
self.data.append(data)
def clear(self):
self.count += 1
self.data = []
if __name__ == "__main__":
lr = learner.remote()
for i in range(2000000000):
print(i, "-----------------------------")
for j in range(20):
lr.step.remote()
lr.clear.remote()
gc.collect()
In the simplified code, I use an actor to produce datas and a learner to get the datas. when I run this code, the RAM is keeping growing. As shown below.
-
at the begining
-
a few minutes later
-
a few minutes later
From my observations, if I slow down the speed of data production, the RAM usage will not keep growing. May be the data production and data getting is so fast that there is no enough time for Python or Ray to collect the garbage.
I still don’t know what the problem is. And how to solve the continuous growth of RAM usage?
Thanks!