[Core] Having trouble evicting objects

michaelarman · April 28, 2021, 1:42pm

Hello!

I’ve been experimenting with Ray for my program which generates solutions using an NSGAII algorithm. In this program I created a custom evaluator for my use case which I’ve parallelized using ray. I am doing this run sequentially by iterating over a number of categories (i.e. the GA runs at a lower level) where each run has 20 generations and I am able to do a few iterations but never the full thing. It seems that the objects are never fully evicted after the iteration is finished.

        results = []
        for i in range(len(chromosomes)):
            trial_idx = int(n_gen*population_size) + i
            offspring[i].trial_idx = trial_idx
            results.append(promos_in_chromosome.remote(self.predictor,chromosomes[i],business_logic_dict_id,pickle_manager_id, trial_idx, cached_id, delimiter_id, category_id, n_gen))

        chromosomes,evaluation,updates_pgScanCaseRate,cache = zip(*ray.get([result for result in results]))
        # do updates
        del results
        del cache
        del cached_id
        del chromosomes, evaluation, updates_pgScanCaseRate

I think I’m removing most of the object refs but my memory still consistently grows.
Moreover is there another way to share memory between workers without using actors? I am trying to collect and use cache for the ray task.

Traceback

  File "/media/root/prophet/DevOps/bnlwe-da-p-80200-prophetball/prophetball/CalendarOptimizerv2/Optimizer/components.py", line 511, in evaluate_all
    chromosomes,evaluation,updates_pgScanCaseRate,cache = zip(*ray.get([result for result in results]))
  File "/root/anaconda3/envs/LarusTF/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/LarusTF/lib/python3.8/site-packages/ray/worker.py", line 1448, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::promos_in_chromosome() (pid=31734, ip=172.16.69.158)
  File "python/ray/_raylet.pyx", line 534, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 535, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1600, in ray._raylet.CoreWorker.store_task_outputs
  File "python/ray/_raylet.pyx", line 151, in ray._raylet.check_status
ray.exceptions.ObjectStoreFullError: Failed to put object 581d4440283ca102ffffffffffffffffffffffff0100000001000000 in object store because it is full. Object size is 30187139 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.

OS: Linux- Ubuntu 20.04
Ray 2.0
RAM: 480GB
CPU: 24

When using ray.put does the object get overwritten? Let’s say for example I have a line like

cache_id = ray.put(cache)

and that line gets called several times. Will the memory increase even if it’s the same object and same size each time? And even if it’s not the same exact object, will it be replaced?

rliaw · April 29, 2021, 3:09am

Will the memory increase even if it’s the same object and same size each time? And even if it’s not the same exact object, will it be replaced?

Yeah, the memory will increase even if it is the same object/same size.

The object won’t get rewritten that way. Objects are removed when there are no more references to it (i.e., you cannot access it anymore) or its owner dies.

Does that help?

michaelarman · May 1, 2021, 4:18pm

Yeah I can see in the ray memory output that the objects are stored for each iteration. The GA I ran has a population of 48 and in the memory output I see 48 objects pinned in memory per generation. I’ve tried many things such as creating a deep copy of the ray.get returns and deleting the originals but that didn’t work. I’ve also tried the ray.internal.internal_api.free and ray.internal.internal_api.global_gc functions. Is there a way to explicitly free the objects that are being pinned? They are still pinned even when I complete the optimization and go to the next one (it runs multiple GA’s sequentially)

sven1977 · June 3, 2021, 2:07pm

Hey @sangcho , could someone from the Ray Core team answer this question?
Thanks

sangcho · June 8, 2021, 9:14am

When the object is still pinned, what’s the ref type in the ray memory command?

sangcho · June 8, 2021, 9:14am

@michaelarman If you still have an issue, I’d love to provide a support!

michaelarman · June 9, 2021, 2:16pm

Sorry for the late reply. I managed to fix it though I’m not exactly sure how. I tried multiple things such as deleting references to ray objects and deepcopying them and deleting originals. The problem was most likely because there were multiple places where the ray object could be called (methods calling methods from other classes)

Topic		Replies	Views
Proper workflow for keeping Ray memory clean and separating returned python objects from their Ray references Ray Core	6	3266	May 11, 2022
Object Storage Management with Ray Actor Tasks that does not need to be saved once executed Ray Core	4	248	February 28, 2024
Memory (RAM) not being released by Ray Ray Core	17	2094	August 26, 2022
Object eviction Ray Core	2	182	January 19, 2024
Memory leakage in object store memory Ray Core	4	327	August 9, 2024

[Core] Having trouble evicting objects

Related topics