Ray Tune: memory leak or bad code?

Kacha · April 10, 2021, 4:51pm

Hi everyone

I am trying to use Ray Tune for the optimization of some parameters to a function.
Packages and versions:

Windows
Ray 1.3
nevergrad 0.4.3.post2

I use a config with 8 parameters to optimize for and generate the search space with tune.choice(), mainly, because the optimizer I use, Nevergrad, doesn’t support quantization.

Following a small MVE:

import gc
import os

import nevergrad as ng
import numpy as np
import ray
from ray import tune
from ray.tune.schedulers import AsyncHyperBandScheduler
from ray.tune.suggest import ConcurrencyLimiter
from ray.tune.suggest.nevergrad import NevergradSearch

os.environ[‘TUNE_GLOBAL_CHECKPOINT_S’] = ‘120’

def function(config, data=None):
some_collection = []
for i in data:
if i[0] / config[‘a’] > 1:
d = {‘z’: config[‘a’] * config[‘b’], ‘y’: config[‘c’] * 1000, ‘x’: config[‘d’] * i[1], ‘w’: config[‘e’] * i[2], ‘v’: config[‘f’] * i[0], ‘u’: config[‘g’] / i[0], ‘t’: config[‘h’] / i[1]}
some_collection.append(d)
objective = 0
try:
for k, v in some_collection[-1].items():
objective += v
except:
pass
tune.report(objective=objective)
del some_collection
gc.collect()
return

def tuning(data):
parameter_list = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’]
config = {}
for i in parameter_list:
config[i] = tune.choice(np.arange(0.1, 10, 0.001))

iterations = 1000
num_cpus = 20
n_particles = 24
phi1 = 1.4962
phi2 = 1.4962
omega = 0.7298

ray.init(num_cpus=num_cpus)
pso = ng.optimizers.ConfiguredPSO(transform='identity', popsize=n_particles, omega=omega, phip=phi1, phig=phi2)
algo = NevergradSearch(optimizer=pso)
algo = ConcurrencyLimiter(algo, max_concurrent=num_cpus)
scheduler = AsyncHyperBandScheduler()

analysis = tune.run(tune.with_parameters(function, data=data), metric='objective', mode='max', name='search', search_alg=algo, scheduler=scheduler, num_samples=iterations, config=config, verbose=1, reuse_actors=True, local_dir='reports')

ray.shutdown()
print('Best candidate found were: ', analysis.best_config)

if name == ‘main’:
data = np.random.rand(100000, 3)
tuning(data)

What I experience is the following:

The memory usage of one process is growing bigger and bigger while the workers roughly maintain their memory usage
The temporary checkpoint file is growing ever larger and causing longer saving times

ray memory doesn’t show anything besides the worker processes and the initial data.

I am now wondering, what is consuming all that memory? Are the some_collection objects stored and never released? From what I have read, del should definitely remove them. Does something else in the code cause this? Does it have something to do with the search space or the searcher? Or the checkpointing?

I’m glad for any kind of advice!

Kind regards
Kacha

kai · April 14, 2021, 9:15pm

Hi Kacha,

it is very hard to tell because the code in unreadable in the current format.

Can you paste it again using the proper formatting? You can just use

```test```

and add a new line around the three apostrophes to get correctly formatted code. Alternatively you can use pastebin or github gists to paste your code.

Thanks!

Topic		Replies	Views
Ray using so much memory I cannot even start the tuning Ray Tune	5	2218	April 24, 2023
Ray Out of Memory Issue Ray Tune	1	188	April 30, 2024
[Tune] Ray tune terminates after OSError: [Errno 28] No space left on device Ray Tune	2	2558	May 7, 2021
Simple hello_world example crashes badly Ray Core	6	390	December 29, 2023
ray.tune.Experiment.from_json() is giving error	0	316	May 16, 2023

Ray Tune: memory leak or bad code?

Related topics