Hello!
I’ve been experimenting with Ray for my program which generates solutions using an NSGAII algorithm. In this program I created a custom evaluator for my use case which I’ve parallelized using ray. I am doing this run sequentially by iterating over a number of categories (i.e. the GA runs at a lower level) where each run has 20 generations and I am able to do a few iterations but never the full thing. It seems that the objects are never fully evicted after the iteration is finished.
results = []
for i in range(len(chromosomes)):
trial_idx = int(n_gen*population_size) + i
offspring[i].trial_idx = trial_idx
results.append(promos_in_chromosome.remote(self.predictor,chromosomes[i],business_logic_dict_id,pickle_manager_id, trial_idx, cached_id, delimiter_id, category_id, n_gen))
chromosomes,evaluation,updates_pgScanCaseRate,cache = zip(*ray.get([result for result in results]))
# do updates
del results
del cache
del cached_id
del chromosomes, evaluation, updates_pgScanCaseRate
I think I’m removing most of the object refs but my memory still consistently grows.
Moreover is there another way to share memory between workers without using actors? I am trying to collect and use cache for the ray task.
Traceback
File "/media/root/prophet/DevOps/bnlwe-da-p-80200-prophetball/prophetball/CalendarOptimizerv2/Optimizer/components.py", line 511, in evaluate_all
chromosomes,evaluation,updates_pgScanCaseRate,cache = zip(*ray.get([result for result in results]))
File "/root/anaconda3/envs/LarusTF/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/LarusTF/lib/python3.8/site-packages/ray/worker.py", line 1448, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::promos_in_chromosome() (pid=31734, ip=172.16.69.158)
File "python/ray/_raylet.pyx", line 534, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 535, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1600, in ray._raylet.CoreWorker.store_task_outputs
File "python/ray/_raylet.pyx", line 151, in ray._raylet.check_status
ray.exceptions.ObjectStoreFullError: Failed to put object 581d4440283ca102ffffffffffffffffffffffff0100000001000000 in object store because it is full. Object size is 30187139 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.
OS: Linux- Ubuntu 20.04
Ray 2.0
RAM: 480GB
CPU: 24
When using ray.put
does the object get overwritten? Let’s say for example I have a line like
cache_id = ray.put(cache)
and that line gets called several times. Will the memory increase even if it’s the same object and same size each time? And even if it’s not the same exact object, will it be replaced?