Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I launched a long running ray cluster. Our teammates use cluster by ray client and ray job.
I know the gcs server memory stores more and more metadata after lots of job is submitted. But I need to know why the memory is increased significantly.
My situation: 100 worker cluster, 3 days,60+job is submitted, 1000+actors (most are finished), but the gcs memory has increased to nearly 15GB now
import ray
ray.init(address='auto')
@ray.remote(num_cpus=0.01)
class MyActor:
def ping(self):
return 100
ACTOR_NUM = 100
l = []
for i in range(ACTOR_NUM):
l.append(MyActor.remote())
for actor in l:
ray.get(actor.ping.remote())
print("Job Done")
observe the memory(RES) growth of gcs_server, each job grows by 15M, and it never goes down until OOM.