Ray Tune: memory leak or bad code?

Hi everyone

I am trying to use Ray Tune for the optimization of some parameters to a function.
Packages and versions:

  • Windows
  • Ray 1.3
  • nevergrad 0.4.3.post2

I use a config with 8 parameters to optimize for and generate the search space with tune.choice(), mainly, because the optimizer I use, Nevergrad, doesn’t support quantization.

Following a small MVE:

import gc
import os

import nevergrad as ng
import numpy as np
import ray
from ray import tune
from ray.tune.schedulers import AsyncHyperBandScheduler
from ray.tune.suggest import ConcurrencyLimiter
from ray.tune.suggest.nevergrad import NevergradSearch

os.environ[‘TUNE_GLOBAL_CHECKPOINT_S’] = ‘120’

def function(config, data=None):
some_collection = []
for i in data:
if i[0] / config[‘a’] > 1:
d = {‘z’: config[‘a’] * config[‘b’], ‘y’: config[‘c’] * 1000, ‘x’: config[‘d’] * i[1], ‘w’: config[‘e’] * i[2], ‘v’: config[‘f’] * i[0], ‘u’: config[‘g’] / i[0], ‘t’: config[‘h’] / i[1]}
some_collection.append(d)
objective = 0
try:
for k, v in some_collection[-1].items():
objective += v
except:
pass
tune.report(objective=objective)
del some_collection
gc.collect()
return

def tuning(data):
parameter_list = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’]
config = {}
for i in parameter_list:
config[i] = tune.choice(np.arange(0.1, 10, 0.001))

iterations = 1000
num_cpus = 20
n_particles = 24
phi1 = 1.4962
phi2 = 1.4962
omega = 0.7298

ray.init(num_cpus=num_cpus)
pso = ng.optimizers.ConfiguredPSO(transform='identity', popsize=n_particles, omega=omega, phip=phi1, phig=phi2)
algo = NevergradSearch(optimizer=pso)
algo = ConcurrencyLimiter(algo, max_concurrent=num_cpus)
scheduler = AsyncHyperBandScheduler()

analysis = tune.run(tune.with_parameters(function, data=data), metric='objective', mode='max', name='search', search_alg=algo, scheduler=scheduler, num_samples=iterations, config=config, verbose=1, reuse_actors=True, local_dir='reports')

ray.shutdown()
print('Best candidate found were: ', analysis.best_config)

if name == ‘main’:
data = np.random.rand(100000, 3)
tuning(data)

What I experience is the following:

  • The memory usage of one process is growing bigger and bigger while the workers roughly maintain their memory usage
  • The temporary checkpoint file is growing ever larger and causing longer saving times

ray memory doesn’t show anything besides the worker processes and the initial data.

I am now wondering, what is consuming all that memory? Are the some_collection objects stored and never released? From what I have read, del should definitely remove them. Does something else in the code cause this? Does it have something to do with the search space or the searcher? Or the checkpointing?

I’m glad for any kind of advice!

Kind regards
Kacha

Hi Kacha,

it is very hard to tell because the code in unreadable in the current format.

Can you paste it again using the proper formatting? You can just use

```test```

and add a new line around the three apostrophes to get correctly formatted code. Alternatively you can use pastebin or github gists to paste your code.

Thanks!