Most runs immediately failing with "out of memory"

gkreder · April 28, 2021, 8:20pm

Hello,

I’m trying to run a simple Ray Tune pipeline for hyperparameter optimization - all on one node for now. This is a simple, CPU-only model.

When I run with a small enough number of samples so that my node’s RAM can handle all the jobs (e.g. 2), things run to completion. However, when I increase the number of samples past the point where they can all run at once in terms of memory usage the jobs don’t queue like I would expect. Instead they all run at once and most of them crash with “OSError: [Errno 12] Cannot allocate memory” and the ray status looks something like this:

Here’s what my tune.run() code looks like:

asha_scheduler = ASHAScheduler(time_attr = 'training_iteration', metric = 'valid_recall', mode = 'max')

reporter = CLIReporter(metric_columns=["valid_recall"])

ray.init(dashboard_port=6007)
result = tune.run(
	train,
	local_dir=config['out_directory'],
	resources_per_trial={"cpu": 1, "gpu": 0},
	config=config,
	num_samples=15,
	progress_reporter=reporter,
	scheduler=asha_scheduler,
	verbose = 3,
	queue_trials = True
)

config is a dictionary with both the train parameters that don’t change as well as the ones for Ray Tune to test. I’ve tried with and without queue_trials. Sorry if I’m missing something basic here…is there a way I can get all trials to run, only running enough at a time to stay within the node’s memory?

Thank you very much

kai · April 28, 2021, 8:24pm

Hi @gkreder, can you try using a ConcurrencyLimiter to limit the number of concurrently running trials?

queue_trials is only used to trigger autoscaling behavior on a cluster and will be deprecated soon.

gkreder · April 28, 2021, 8:42pm

I’m sorry, is there a way I can wrap ConcurrencyLimiter around a basic random searcher?

I’m using

tune.sample_from

in my configs so can’t use HyperOpt for the search_alg. I have this right now:

search_alg = tune.create_searcher('random')
search_alg = ConcurrencyLimiter(search_alg, max_concurrent=2)

But I get the error

kai · April 28, 2021, 8:50pm

Hm, it seems that’s a bug. I’ll fix this tomorrow. In the meantime, can you try this hack:

search_alg = tune.suggest.basic_variant.BasicVariantGenerator()
search_alg.mode = None
search_alg = ConcurrencyLimiter(search_alg, max_concurrent=2)

Alternatively, set the cpus in resources_per_trial to 16, this will also limit the number of CPUs that are used (the CPUs are just reserved and won’t actually be used, so other tasks still have access to them).

gkreder · April 28, 2021, 9:07pm

Awesome, thank you very much @kai! In the meantime, setting cpus 16 per trial or using HyperOpt + ConcurrencyLimiter seems to be working

kai · May 11, 2021, 9:10am

Just for reference, this has been fixed here: [tune] add `max_concurrent` option to BasicVariantGenerator by krfricke · Pull Request #15680 · ray-project/ray · GitHub

Topic		Replies	Views
Ray Out of Memory Issue Ray Tune	1	201	April 30, 2024
Memory explosion with TuneSearchCV Ray Tune	6	542	February 19, 2021
Adding memory in resources_per_trial in tune.run() hangs	2	414	October 28, 2022
[Ray Tune] Blocking for next trial Ray Tune	3	472	June 8, 2022
Ray Tune jobs fails with no explicit reasons Ray Tune	12	610	April 12, 2023

Most runs immediately failing with "out of memory"

Related topics